The article explores the challenges in evaluating clustering algorithms and introduces the Normalised Clustering Accuracy (NCA) as a new external cluster validity measure. NCA addresses the shortcomings of traditional measures by being scale-invariant, bounded, and monotonic. The paper discusses the formal properties of NCA and compares it with existing measures, highlighting its advantages in evaluating clustering algorithms. The authors also provide insights into the practical implications of using NCA and suggest future research directions.
AI Generated
This summary of the content was generated with the help of AI.
Abstract
There is no, nor will there ever be, single best clustering algorithm. Nevertheless, we would still like to be able to distinguish between methods that work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify different aspects of the obtained partitions, e.g., the average degree of cluster compactness or point separability. However, their validity is questionable because the clusterings they endorse can sometimes be meaningless. External measures, on the other hand, compare the algorithms’ outputs to fixed ground truth groupings provided by experts. In this paper, we argue that the commonly used classical partition similarity scores, such as the normalised mutual information, Fowlkes–Mallows, or adjusted Rand index, miss some desirable properties. In particular, they do not identify worst-case scenarios correctly, nor are they easily interpretable. As a consequence, the evaluation of clustering algorithms on diverse benchmark datasets can be difficult. To remedy these issues, we propose and analyse a new measure: a version of the optimal set-matching accuracy, which is normalised, monotonic with respect to some similarity relation, scale-invariant, and corrected for the imbalancedness of cluster sizes (but neither symmetric nor adjusted for chance).
Notes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
Clustering is an unsupervised learning technique that aims at identifying semantically useful partitions of a given dataset (Hennig, 2015; von Luxburg et al., 2012). Up to this date, many clustering algorithms have been proposed, but the problem of how to evaluate the overall quality of the outputs they generate is still open for discussion (Xiong & Li, 2014; Tavakkol et al., 2022; Ullmann et al., 2022; van Mechelen et al., 2023). We know that there will never be a single “best” method (Ackerman et al., 2021; Strobl & Leisch, 2022). Nevertheless, at the very least, we would like to be able to identify the algorithms that are somewhat sensible on certain classes of datasets, or filter out those that consistently yield disappointing results.
Internal validity measures, such as the Caliński–Harabasz, Dunn, or Silhouette index (Caliński & Harabasz, 1974; Dunn, 1974; Rousseeuw, 1987), are often used to quantify how well a given partition reflects the structure of the underlying unknown data distribution, for instance, the degree of compactness or separability (e.g., Milligan and Cooper, 1985; Maulik and Bandyopadhyay, 2002; Halkidi et al., 2001; Arbelaitz et al., 2013; Xu et al., 2020). However, they have been recently criticised by Gagolewski et al. (2021) who noted that some popular measures often promote clusterings that are not meaningful, e.g., they return cluster memberships that resemble noise or should rather be employed as outlier detectors (see Fig. 1 therein).
Advertisement
External validity measures, on the other hand, operate under the assumption that benchmark datasets are equipped with expert-given labels; this is true for many recently introduced test batteries (Graves & Pedrycz, 2010; Thrun & Ultsch, 2020; Dua & Graff, 2022; Fränti & Sieranoja, 2018; Gagolewski, 2022). Moreover, they presume that it would be best if an algorithm returned a clustering that is as similar to the reference one as possible. In the literature, it has become customary to use partition similarity scores as external measures, e.g., the adjusted Rand, normalised mutual information, or pair sets indices (Wagner & Wagner, 2006; Horta & Campello, 2015; Rezaei & Fränti, 2016). However, in this paper, we argue that simpler objects can actually be more suitable.
Let \(X_1,\dots ,X_k\) be a reference (ground truth, indicated by experts) partition of a set X of n objects into k nonempty and pairwise disjoint clusters. Moreover, let \(\hat{X}_1,\dots ,\hat{X}_k\) be a predicted (generated by a clustering algorithm) partition of X into k disjoint clusters. Commonly, the knowledge about which clusters correspond to one another is summarised by a \(k\times k\) confusion matrix \(\textbf{C}=(c_{i,j})\) whose entry in the i-th row and the j-th column gives the number of elements in the i-th true cluster that an algorithm allocated to the j-th predicted cluster, i.e., \(c_{i,j}=\#(X_i \cap \hat{X}_j)\).
Fig. 1
A reference (ground truth; \(k=3\)) partition of an example dataset (WUT/x2; \(n=120\); see Sect. 4) and some predicted clusterings that we would like to relate to it. We also report the confusion matrices and the values of a few external cluster validity measures defined in Sects. 2 and 3. The Gaussian mixture algorithm misclassifies only five points (ca \(4\%\)), but the indices’ values are quite different. In our case, the non-normalised measures (here: CA, FM, and R) do not distinguish between the cases of two undesirable partition types: assigning most points into a single cluster (as returned by single linkage) and memberships assigned at random. Note that there can be many possible reference partitions for a given dataset (Dasgupta & Ng, 2009; Gagolewski, 2022): the clusterings returned both by the Gaussian mixture method and K-Means can be considered meaningful (at least, in the current author’s opinion). However, we are only relating a predicted partition to one reference clustering at a time
In this paper, we are interested in studying real-valued functions aiming at quantifying how similar a predicted partition is to the reference one; compare Fig. 1. For a measure to be useful in the task at hand, it should meet a number of desirable properties (postulates). For the purpose of this introduction, we will now state them only descriptively; the formalism will follow in Sect. 3:
[MON] The more similar the partitions are, the higher the score should be (in our case, we will consider monotonicity with respect to the diagonal max-dominance relation that we define below).
[B1] Only two identical partitions yield the highest possible similarity score, which we conventionally assume to be equal to 1.
[PER] The score does not change if we relabel the cluster IDs (i.e., swapping \(X_i \leftrightarrow X_j\) or \(\hat{X}_i \leftrightarrow \hat{X}_j\) for any \(i\ne j\); this comports with a rearrangement of rows or columns in the corresponding confusion matrix).
The third property stems from the fact that a partition is a set of clusters, and sets (equivalence classes in the equivalence relation “two points belong to the same cluster”) are unordered by definition. From this perspective, “Gaussian mixture” and “Gaussian mixture relabelled” in Fig. 1 represent identical partitions. Overall, clustering is an unsupervised learning problem; therefore, we cannot expect an algorithm to guess the order of reference labels.
Remark 1
Traditional approaches to defining classes of partition similarity scores usually require the fulfilment of the above [MON], [B1], and [PER]. However, additionally (see, e.g., Meilă, 2005; Wagner and Wagner, 2006; Xiong and Li, 2014; Rezaei and Fränti, 2016; Arinik et al. 2021) they assume that:
[SYM] The score is the same if we swap the roles of the two partitions.
In other words, none of the two partitions is treated specially. However, in the context of our paper, we consider \(X_1,\dots ,X_k\) as a fixed point of reference for \(\hat{X_1},\dots ,\hat{X_k}\). Therefore, requiring this condition is not necessary.
Advertisement
Further postulates are related to the worst-case scenarios. When evaluating clustering algorithms on many diverse datasets, we report aggregated similarity scores. It may thus be desirable to have the lower index bound not dependent on, amongst others, the number of clusters k, which may vary across benchmark instances. In other words, we may1 want to have the indices on the same scale: a common choice is the unit interval. And thus, in Sect. 3, we will discuss the following properties.
[B0] The smallest possible value of the index is 0.
[U0] Perfectly uniform assignment of points to predicted clusters, i.e., when \(c_{i,j}=c_{i,i}\) for all i, j, results in the similarity score of 0.
[O0] Assigning all the points to a single cluster gives the similarity score of 0.
We will also relate the above to the adjustment for chance, where the expected value of an index is 0 given two independent partitions having the same marginal frequencies (property [E0]). Nevertheless, we will note that, in general, it contradicts the three above properties.
Moreover, we will also be interested in the following types of scale invariances.
[SU] The similarity score should remain the same when we double, triple, etc., the number of points in each subset \(X_i\cap \hat{X}_j\), without changing the structure of the discovered clusters (the output for \(s\textbf{C}\) should be the same as the one for \(\textbf{C}\) for any \(s>0\)).
[SC] The similarity score should be the same if we multiply points in one reference cluster. In particular, it should not be affected by the imbalancedness of the true cluster sizes, whether some clusters are relatively small or large (the output for \(\textbf{C}\) should be the same as \(\textrm{diag}(s_1,\dots ,s_k)\,\textbf{C}\) for any \(s_1,\dots ,s_k>0\)).
The paper is structured as follows. In the next section, we introduce three basic classes of partition similarity scores. In Sect. 3, we formalise the above properties, study which indices fulfil them, and discuss their various modifications. In particular, we derive a new score called normalised clustering accuracy, which is given by:
where \(c_{i,\cdot }\) is the number of elements in the i-th true cluster.
NCA is the averaged percentage of correctly classified points in each cluster above the perfectly uniform cluster membership assignment. NCA relies on finding the permutation (relabelling) that gives the optimal matching of cluster IDs between the reference and the predicted set, which can be computed in \(O(k^3)\) time. We will prove that it is normalised to the unit interval, monotonic with respect to a particular similarity relation, and corrected for cluster sizes’ imbalancedness.
Further, in Sect. 4, we analyse the degree of association between pairs of indices on an example benchmark dataset battery and inspect how the choice of a similarity measure affects the rankings of a few well-known clustering algorithms. Section 5 sketches possible extensions of the introduced measure to the case of clusterings of different cardinalities.
The implementations of the discussed measures are included in the open-source genieclust (Gagolewski, 2021) package for Python and R.
2 External Cluster Validity Measures
Let \(\textbf{C}=(c_{i,j})\) be a confusion (matching) matrix of size \(k\times k\), whose rows correspond to the reference clusters, and columns summarise the predicted cluster memberships:
For brevity, we denote the total sum by \(n = \sum _{i=1}^k \sum _{j=1}^k c_{i,j}\), and the row- and column-wise sums by \({c}_{i,\cdot } = \sum _{j=1}^k c_{i,j}\) and \({c}_{\cdot ,j} = \sum _{i=1}^k c_{i,j}\).
In the classical (crisp) clustering setting, all \(c_{i,j}\)s are nonnegative integers.2 In such a case, \(c_{i,j}\) denotes the number of points from the i-th reference cluster that an algorithm assigns to the j-th predicted group. Moreover, then n gives the total number of points, \({c}_{i,\cdot }\) is the cardinality of the i-th reference cluster, and \(c_{\cdot ,j}\) is the size of the j-th predicted one. Here, the clusterings can be represented by means of two label vectors \(\textbf{y}=(y_u)\) and \(\hat{\textbf{y}}=(\hat{y}_u)\) of length n, where \(y_u,\hat{y}_u \in \{1,\dots ,k\}\) denote the true and predicted memberships of the u-th point. We thus have \(c_{i,j}=\#\{ u: y_u=i\text { and }\hat{y}_u=j \}\), \({c}_{i,\cdot } = \#\{ u: y_u=i \}\), and \({c}_{\cdot ,j} = \#\{ u: \hat{y}_u=j \}\).
Let us review some of the most seminal classes of crisp partition similarity scores to which we will relate our proposal. More measures are discussed by, amongst others, Xiong and Li (2014); Wagner and Wagner (2006); Rezaei and Fränti (2016); Arinik et al. (2021).
2.1 Counting Concordant and Discordant Point Pairs
The first class of indices is based on counting point pairs that are concordant:
A detailed overview of the behaviour of the indices based on counting object pairs is presented by Warrens and van der Hoef (2022); Horta and Campello (2015); Lei et al. (2017). In the sequel, we will consider various variations of the above indices, e.g., their normalised and adjusted for chance versions.
2.2 Information-Theoretic Measures
Another group of partition similarity scores consists of the so-called information-theoretic measures. As a point of departure for further derivations, let us recall the mutual information score (Horibe, 1985):
with convention \(0\,\log x=0\) for any \(x\ge 0\). Two variants of this index will be presented below; for further discussion, see the papers by Vinh et al. (2010); van der Hoef and Warrens (2019).
2.3 Accuracy-like Set-matching Measures
Let us note that the Rand and the Fowlkes–Mallows scores use \(1/{n \atopwithdelims ()2}\) as the unit of information. Sometimes, we might prefer working on the 1/n-based scale. However, it would be a mistake to rely on the standard accuracy as known from the evaluation of classification models:
which is the proportion of correctly classified points. This measure should not be used in clustering because clusters are defined up to a permutation of the sets’ IDs (compare the desired property [PER]).
In our context, the predicted clusters need to be matched with the true ones somehow. For instance, the confusion matrix corresponding to the label vectors \(\textbf{y} =(1, 2, 2, 1, 2, 3, 1, 1, 1)\) and \(\hat{\textbf{y}}=(3, 1, 1, 3, 1, 2, 3, 3, 3)\),
represents a perfect match. Hence, in this case, we could use \((c_{2,1}+c_{3,2}+c_{1,3})/n\) as a measure of accuracy. Consequently, we need an algorithmic way to translate between the predicted and reference cluster IDs. The simplest choice involves greedy pairing:
These measures are sometimes referred to as purity. Unfortunately, there is no guarantee that all clusters will welcome a match.
If we want to remedy this issue, we should seek a one-to-one correspondence between the cluster IDs, \(\sigma \), which is a solution to the following optimisation problem:
where \(\mathfrak {S}_k\) is the set of all permutations of the set \(\{1,\dots ,k\}\), i.e., bijections from \(\{1,\dots ,k\}\) to itself.
The above guarantees that each column is paired with one and only one row in the confusion matrix. Such optimal pairing leads to, what we call here, the pivoted accuracy (Meilă and Heckerman, 2001, Eq. (13); Steinley, 2004, Eq. (10); Charon et al., 2006; Chacón, 2021):
It relies on the best matching of the labels in the reference partition to the labels in the predicted grouping so as to maximise the standard accuracy. At first glance, it is a very attractive measure because of its interpretability (a feature whose importance was already noted by Goodman and Kruskal (1979)). However, soon, we will note that its values for the worst possible clusterings depend on the number of clusters k.
Remark 2
Eq. 11 can be expressed as the following 0–1 integer linear programming problem:
such that \(\textbf{B}=(b_{i,j})\) is a binary matrix with one and only one value 1 in every row and in every column, i.e., \(b_{i, \cdot } = 1\) and \(b_{\cdot , j}=1\) for all i, j. It is called the maximal linear sum assignment problem or maximum bipartite graph matching. It can be solved using, e.g., the so-called Hungarian algorithm, which requires \(O(k^3)\) time (Crouse, 2016).
Remark 3
Note that optimal matching is not the same as the greedy recursive pairing, where we pick the largest element, and then continue the same procedure in the yet-to-be-selected parts of the matrix. For example, given:
the optimal permutation is \(\sigma =(1, 3, 2)\), yielding the sum \(50+39+39 = 128\), whereas the greedy recursive pairing gives \(\sigma =(1, 2, 3)\), corresponding to \(50+40+22=112\).
In what follows, we will discuss many modifications of the aforementioned indices.
3 Desirable Properties and Features of Indices
Let \(\mathfrak {C}^{k\times k}\) denote the set of admissible confusion matrices of size \(k\times k\) such that if \(\textbf{C}=(c_{i,j})\in \mathfrak {C}^{k\times k}\), then \(c_{i,j}\ge 0\) and \(c_{i,\cdot }>0\) for all i, j. Note that the case where the confusion matrix is non-square will be mentioned separately later (Sect. 5).
Even though we assume that our reference partitions are crisp, in general, we do not want to restrict ourselves to classical (crisp) clustering only. Thus, whether we allow \(c_{i,j}\)s to be arbitrary nonnegative real numbers (not just integer ones), will depend on the index. This will not only simplify the further analysis, but also allow us to accommodate, amongst others, the case of weighted (fuzzy) predicted clusterings, where point memberships are described by probability vectors; compare, e.g., the papers by Hüllermeier et al. (2012); Campagner et al. (2023); Andrews et al. (2022); Flynt et al. (2019); D’Ambrosio et al. (2021); Horta and Campello (2015). We will see that under the scale invariance [SU] property discussed below, this will come without loss in generality.
Additionally, we assume that none of the reference clusters is empty. However, we actually allow a clustering algorithm to return a partition of lower cardinality than k, which is represented by the case of some \(c_{\cdot ,j}\)’s being equal to 0.
In Sect. 1, we outlined a few desirable properties of cluster validity measures in a rather informal manner. Let us formalise them now so that we can introduce various adjustments to the indices. For brevity, all the definitions below should be read as “We say that an index \(\textrm{I}\) meets Property [X], whenever for any \(k\ge 2\) and...”. A summary will be given in Table 2.
3.1 Permutation Invariance
For any \(\sigma \in \mathfrak {S}_k\), denote by \(\textbf{P}_\sigma =(p_{i,j})\) the corresponding permutation matrix, i.e., a \(k\times k\) binary matrix with \(p_{i,j}=1\) if and only if \(i=\sigma (j)\). We stated that clusters are defined up to a permutation of cluster IDs. Therefore, rearranging rows or columns of a confusion matrix should not affect the index value.
Definition 1
(Property [PER]) For all \(\textbf{C}\in \mathfrak {C}^{k\times k}\) and every permutation \(\sigma \in \mathfrak {S}_k\), we have \(\textrm{I}(\textbf{C})=\textrm{I}(\textbf{P}_\sigma \textbf{C})=\textrm{I}(\textbf{C} \textbf{P}_\sigma )\).
As the standard classification accuracy \(\ddot{\textrm{A}}\) (7) is the only index amongst the ones considered in this paper that does not fulfil this property, we will no longer be considering it a viable cluster validity measure candidate.
3.2 Symmetry
In traditional partition similarity scores, both partitions should be treated equally. This is reflected by the symmetry property.
Definition 2
(Property [SYM]) For all \(\textbf{C}\in \mathfrak {C}^{k\times k}\), we have \(\textrm{I}(\textbf{C})=\textrm{I}(\textbf{C}^T)\).
Disregarding from now on the two versions of purity \(\ddot{\textrm{P}}\) and \(\ddot{\textrm{P}}^T\), which are semantically problematic, all the aforementioned scores enjoy this condition. However, we have already argued that in the context of external cluster validation, [SYM] can be omitted. For a given benchmark problem, the reference partition is fixed. Thus, we treat it differently from the predicted ones, because the latter vary across the algorithms under scrutiny.
3.3 Scale Invariance
Another property we might find useful is that any scaling of the confusion matrix should not change the similarity assessment. Scaling can be interpreted as adding or removing new points to the detected clusters without disturbing the discovered structure while maintaining the proportions of cluster sizes. In other words, if we have 50% points correctly classified, whether this was achieved for \(n=100\) or \(n=10{,}000\) should not matter.
Definition 3
(Property [SU]) For all \(\textbf{C}\in \mathfrak {C}^{k\times k}\) and every \(s>0\), we have \(\textrm{I}(s\textbf{C})=\textrm{I}(\textbf{C})\).
Amongst the indices studied so far, only those based on counting concordant/discordant point pairs do not enjoy this property.
Example 1
For instance, for \(\textbf{C}\) given by Eq. 14, we have \(\textrm{FM}(\textbf{C})\approx 0.35297\) but \(\textrm{FM}(3\textbf{C})\approx 0.35727\), and \(\textrm{R}(\textbf{C})\approx 0.56928\) but \(\textrm{R}(3\textbf{C})\approx 0.57023\). The difference becomes even more significant when we consider non-integer confusion matrices. Assuming that we generalise \({x \atopwithdelims ()2}\) for arbitrary reals as \(x(x-1)/2\), we get, e.g., \(\textrm{FM}(\frac{1}{900} \textbf{C})\approx -1.08054\) and \(\textrm{R}(\frac{1}{900} \textbf{C})\approx 1.21464\). This is why, in the context of \(\textrm{R}\) and \(\textrm{FM}\) scores, the study of their properties will be limited to integer matrices only.
However, we can consider \(\textrm{R}'(\textbf{C})=\lim _{s\rightarrow \infty } \textrm{R}(s\textbf{C})\) and \(\textrm{FM}'(\textbf{C})=\lim _{s\rightarrow \infty } \textrm{FM}(s\textbf{C})\). Noting that in Eqs. 3 and 5, if the total sum of \(\textbf{C}\) is n, then the sum of \(s\textbf{C}\) is sn, these limits are given by:
The above arise by replacing \({x \atopwithdelims ()2}\) with \(x^2/2\) in Eqs. 2 and 4. Therefore, the price for the fulfilment of the scale invariance property is the loss of the original interpretation related to the counting of concordant pairs.
3.4 Upper Bound
From now on, let \(\mathfrak {C}^{k\times k}_{s_1,\dots ,s_k}\subseteq \mathfrak {C}^{k\times k}\) denote the set of confusion matrices like \(\textbf{C}=(c_{i,j})\) whose i-th row’s sum \(c_{i,\cdot }\) is equal to \(s_i\), for all \(i=1,\dots ,k\).
In order for an index value to be interpretable, we need to calibrate it so that we know which scores indicate low and high-quality outcomes. In the literature, it is customary to assign the value of 1 when there is a perfect match, and preferably only then. In clustering problems, it is the case when each predicted set can be mapped to precisely one reference cluster, and vice versa. This is represented by confusion matrices with 0 s everywhere but on the diagonal or (by [PER]) their permuted versions; compare, e.g., Eq. 8.
Definition 4
(Property [B1]) For all \(s_1,\dots ,s_k>0\) and every \(\textbf{C}\in \mathfrak {C}^{k\times k}_{s_1,\dots ,s_k}\), we have \(\textrm{I}(\textbf{C})\le 1\). Moreover, \(\textrm{I}(\textbf{C})=1\) if and only if there exists a permutation \(\sigma \in \mathfrak {S}_k\) such that \(\textbf{C}=\textbf{P}_\sigma \textbf{S}\), where \(\textbf{S}=\textrm{diag}(s_1,\dots ,s_k)\).
The Rand and Fowlkes-Mallows indices enjoy [B1] when restricted to the case of integer matrices. Mutual information is not normalised; e.g., for a matrix like \(\textbf{I}=\textrm{diag}(s,\dots ,s)\), we have \(\textrm{MI}(\textbf{I})=\log k\) for all \(s>0\). Other indices discussed so far fulfil this property.
Example 2
The MI score can be rescaled so as to meet [B1]. For instance, the normalised mutual information score denoted by \(I/D_2\) by Kvalseth (1987) and \(\textrm{NMI}_\textrm{sum}\) by Vinh et al. (2010) is given by:
In the context of comparing partitions, some statistical literature finds it desirable if two groupings generated independently at random are assigned a similarity score of 0 on average (Hubert & Arabie, 1985; Vinh et al., 2010).
For instance, under the hypergeometric model for randomness discussed by Fowlkes and Mallows (1983) (for possible alternatives, see Steinley (2004) and Gates and Ahn (2017)), the two partitions are assumed to be picked independently at random subject to having the original n, k, and counts of objects in each cluster, i.e., \(c_{1,\cdot },\dots ,c_{k,\cdot }\), and \(c_{\cdot ,1},\dots ,c_{\cdot ,k}\). Given this assumption, the p-th raw moment is \(\mathbb {E}\, c_{i,j}^p = {c_{i,\cdot }^p\,c_{\cdot ,j}^p}/{n^p}\). We can thus introduce the following property.
Definition 5
(Property [E0]) For all \(s_1,\dots ,s_k>0\), \(t_1,\dots ,t_k>0\), and the random variable \(\textbf{C}\in \mathfrak {C}^{k\times k}\) generated from the hypergeometric model having \(c_{i,\cdot }=s_i\) and \(c_{\cdot ,j}=t_j\) for all i, j, we have \(\mathbb {E}\,\textrm{I}(\textbf{C})=0\).
Given any index \(\textrm{I}\), its adjusted-for chance version can be constructed by taking:
where \(\bar{\textrm{I}}\) is the maximal index value (e.g., 1 for indices fulfilling [B1]) and \(\widetilde{\textrm{I}}\) is the expected index given the assumed randomness model. This way, the maximal value of \(\textrm{AI}\) is 1 and its expectation is 0 when two partitions are unrelated.
Note that Eqs. 20 and 21 are very similar: the difference is that in the former, we have the arithmetic mean in the denominator, whilst in the latter, we see the geometric mean of two terms. In practice, the two indices behave very similarly (see Sect. 4).
Example 4
An adjusted version of the NMI score was studied by Vinh et al. (2010) (who denoted it by \(\textrm{AMI}_\textrm{sum}\)). By noting that:
being a formula we will not be expanding further because of its complexity.
None of the three aforementioned adjusted indices are scale-invariant. By construction, they are defined only for integer confusion matrices.
3.6 Lower Bounds and Normalisation of Indices
An index that is adjusted for chance has the expected value of 0 for partitions generated at random from an assumed model of randomness. In order for that to be possible, it must take negative values in cases “worse than average when picked at random”.
Example 5
Canvass the two following matrices with the same corresponding row sums:
From the current paper’s perspective, the former case is not as undesirable as the latter, where the predicted cluster memberships in each true cluster are assigned uniformly. The adjusted Rand index indicates this correctly: it yields \(\textrm{AR}(\textbf{C})=0\) and \(\textrm{AR}(\textbf{U})\approx -0.019\). However, we argue that a negative index value might be difficult to interpret,3 especially if its lower bound depends on the scale of the confusion matrix.
Instead of looking from a statistical viewpoint, we can take an algebraic perspective, where bounding the index from below, e.g., by 0 could be more informative4 (Charon et al., 2006; Chacón & Rastrojo, 2023). This is beneficial in the case where we run a clustering algorithm on many benchmark datasets and thus obtain numerous similarity scores that should be aggregated into a single number (e.g., by considering sample quantiles or the arithmetic mean). In such a way, we bring all the indices to the same range, e.g., [0, 1], which is dependent on neither k nor n. If the minimum is difficult to obtain or is uninformative, the value of 0 should be attained by confusion matrices that we identify as undesirable outcomes.
We proclaim that there are two worst-case outcomes of a clustering algorithm when its results are compared with a fixed reference partition. The first scenario is where the elements in each row of a confusion matrix are equal to each other:
It represents predicted partitions where all the points are allocated to the same (one6) cluster, assuming given true cluster sizes \(s_1,\dots ,s_k>0\).
These two cases lead us to the following desirable properties.
Definition 6
(Property [U0]) For all \(s_1,\dots ,s_k > 0\), we have \(\textrm{I}(\textbf{U}_{s_1,\dots ,s_k}^{k\times k})=0\).
Definition 7
(Property [O0]) For all \(s_1,\dots ,s_k > 0\), we have \(\textrm{I}(\textbf{O}_{s_1,\dots ,s_k}^{k\times k})=0\).
In Table 1, we have summarised the index values for matrices given by Eqs. 24 and 25. The derivations are quite elementary; hence, they were omitted. Based on these results, we can imply that out of the indices considered so far, only MI and NMI fulfil [U0]. On the other hand, [O0] is true for MI, NMI, AR, AFM, and AMI.
Table 1
Values of indices for confusion matrices given by Eqs. 24 (uniform assignment; see property [U0]) and 25 (all points assigned to a single cluster; see [O0]), where \(S^2=\sum _{i=1}^k s_i^2\). The R, FM, AR, AFM, and AMI indices assume that input matrices are integer
The exact formulae denoted by standalone “\(<0\)”s were omitted due to their complexity (they are all negative and approach 0 as \(n\rightarrow \infty \))
From Table 1, we see that for matrices like \(\textbf{U}\) (24), AR, AFM, and AMI actually take negative values; Chacón and Rastrojo (2023) give the formula for the minimum of AR. Furthermore, R\('\), FM\('\), and A are bounded from below by 1/k; see Appendix A.2 for a proof in the case of the A index. Thus, that the lowest possible value of an index is equal to 0 must be introduced as a separate property.
Definition 8
(Property [B0]) For all \(s_1,\dots ,s_k>0\), we have that \(\min _{\textbf{C}\in \mathfrak {C}^{k\times k}_{s_1,\dots ,s_k}} \textrm{I}(\textbf{C})=0\).
Out of the indices considered so far, only MI and NMI fulfil [B0]. Note that even if an index fulfils both [U0] and [O0], there is no guarantee that it is bounded from below by 0.
Example 6
Consider the limiting version of AR (which is a formula equivalent to the one proposed by Morey and Agresti (1984)),
we have \(\textrm{NR}'(\mathbf {\Omega })=\textrm{NFM}'(\mathbf {\Omega })=-\frac{1}{15}.\)
If we are able to identify the minimum of an index \(\textrm{I}\) over all matrices with given row sums, denoted by \(\underline{\textrm{I}}\), we can introduce its normalised version by taking:
where again \(\bar{\textrm{I}}\) is the maximal index value. This way we obtain the index that has the minimum value of 0 and maximum of 1.
Example 7
For all \(s_1,\dots ,s_k>0\), we have \(\min _{\textbf{C}\in \mathfrak {C}^{k\times k}_{s_1,\dots ,s_k}} \textrm{A}(\textbf{C})=1/k\); see Appendix A.2.7 We can thus introduce the normalised variant of the pivoted accuracy:
NA is an example of an index that fulfils [U0] and [B0], but not [O0] (compare Table 1).
3.7 Invariance to Cluster Sizes
Here is a more restrictive version of scale invariance: we posit that increasing the number of points in any reference cluster and assigning the new points to the predicted sets without disturbing the proportions of allocations should not change a given cluster validity index; compare the paper by Rezaei and Fränti (2016), who studied a similar postulate. For instance, if 50% of the points are correctly classified in the first reference cluster, then it should not matter whether its cardinality is, say, \(c_{1,\cdot }=100\) or \(c_{1,\cdot }=10{,}000\).
Definition 9
(Property [SC]) For all \(\textbf{C}\in \mathfrak {C}^{k\times k}\) and every \(s_1,\dots ,s_k>0\), we have \(\textrm{I}(\textbf{S} \textbf{C})=\textrm{I}(\textbf{C})\), where \(\textbf{S}=\textrm{diag}(s_1,\dots ,s_k)\).
Note that \(\textbf{S} \textbf{C}\) is a version of the original matrix whose i-th row is multiplied by \(s_i\).
If an index satisfies [SC], we will say that it is invariant to inequalities in the cluster sizes. Clearly, [SC] implies [SU]. Unfortunately, none of the aforementioned indices fulfils [SC]. However, if an index \(\textrm{I}\) satisfies [SU], it is natural to consider its corrected-for-cluster-sizes version \(\textrm{CI}\) given for any \(\textbf{C}\in \mathfrak {C}^{k\times k}\) by \(\textrm{CI}(\textbf{C})=\textrm{I}(\textbf{S}\textbf{C})\), where \(\textbf{S}=\textrm{diag}(1/c_{1,\cdot },\dots ,1/c_{k,\cdot })\).
where \(\textbf{S}=\textrm{diag}(1/c_{1,\cdot },\dots ,1/c_{k,\cdot })\).
They have all gained [SC] at the cost of losing [SYM]. Interestingly, compared with their original counterparts, NCR\('\) and NCFM\('\) now additionally fulfil the previously missing [B0] (see Appendix A.1 for the proof).
Example 9
Using the same transformation, but this time applied on the pivoted accuracy given by Eq. 12, we can also introduce the clustering accuracy:
It is the average of the proportions of correctly classified points in each true cluster.
Example 10
As \(\min _{\textbf{C}\in \mathcal {C}_{s_1,\dots ,s_k}^{k\times k}} \textrm{CA}(\textbf{C})=1/k\) (see Appendix A.2), we can finally arrive at the new normalised clustering accuracy that we announced in the introduction, which is the normalised version of CA:
NCA fulfils [PER], [SU], [SC], [B1], [U0], [O0], and [B0].
Remark 4
On a side note, a different type of scaling was considered by Rezaei and Fränti (2016, Sec. 6.1). It is based on the Braun-Blanuqet similarity coefficient (Braun-Blanquet, 1932) and leads to the index given by:
It fulfils [U0] and [O0], but fails to meet [B0]; e.g., for the matrix given by Eq. 28, we get \(\textrm{NBA}(\mathbf {\Omega })=-1/3\). Interestingly, the authors overcome this limitation by simple clipping, defining the pair sets index as:
It is not rare in the literature to seek functions like \(f:Z\rightarrow \mathbb {R}\), which preserve a certain partial order \(\preceq \) on a set Z, i.e., expect that \(f(z)\le f(z')\) whenever \(z\preceq z'\). For instance, generalised means (e.g., the arithmetic mean and the median) are nondecreasing in each variable (Bullen, 2003, p. xxvi; Grabisch et al., 2009, Chap. 4), and economic inequality indices (e.g., the Gini or Bonferroni ones) preserve the majorisation relation (are Schur convex, satisfy the principle of progressive transfers; Arnold, 2015, Chap. 4).
In the literature, the monotonicity property of clustering similarity indices is often studied in the context of merging or splitting clusters; see the overview by Arinik et al. (2021). Inspired by an empirical sensitivity analysis by Rezaei and Fränti (2016, Sec. 7.3), we would now like to propose one of the possible ways to formalise the property descriptively formulated as “the more similar the predicted partition is to the reference one, the higher the score should be”.
Let us define a class of confusion matrices where the maximal elements in each row lie on the main diagonal.
Definition 10
We call a matrix \(\textbf{C}\in \mathfrak {C}^{k\times k}\)diagonally max-dominant, whenever for all i, j, we have \(c_{i,i} \ge c_{i, j}\).
Fig. 2
Change of index values \(\textrm{I}(\textbf{C}')-\textrm{I}(\textbf{C})\) for randomly generated diagonally max-dominant matrices \(\textbf{C}\) with \(k=4\), \(c_{1,\cdot }=c_{2,\cdot }=c_{3,\cdot }=100\), and \(c_{4,\cdot }=700\), where \(\textbf{C}'\) is the same as \(\textbf{C}\) except that \(c'_{1,1}=c_{1,1}+1\) and \(c'_{1,2}=c_{1,2}-1\) (one point’s predicted cluster membership changed). Ideally, an index’s response to such an improvement in clustering accuracy should be nonnegative (property [MON]). Amongst the indices studied, only A, CA, NA, and NCA guarantee this
In our context, such matrices represent the case where there is no doubt as to which predicted cluster corresponds to which reference one. We proclaim that, then, correcting the belongingness of a misclassified point (moving it from cluster j to cluster i, when it indeed belongs to i), should, at the very least, not result in a decrease of an external cluster validity measure.
Let the partial order \(\preceq _{\text {DMD}}\) on \(\textbf{C}\in \mathfrak {C}^{k,k}\) be defined in such a way that \(\textbf{C}\preceq _{\text {DMD}}\textbf{C}'\) if and only if \(\textbf{C}\) and \(\textbf{C}'\) have identical row sums (\(c_{i,\cdot }=c_{i,\cdot }'\), \(i=1,\dots ,k\)), are diagonally max-dominant, and \(c_{i,i}\le c_{i,i}'\) for all i.
Definition 11
(Property [MON]) If \(\textbf{C}\preceq _{\text {DMD}}\textbf{C}'\), then \(\textrm{I}(\textbf{C})\le \textrm{I}(\textbf{C}')\).
When \(\textrm{I}\) satisfies [MON], then for all diagonally max-dominant \(\textbf{C}\in \mathfrak {C}^{k,k}\), every \(i\ne j\), and \(c_{i,j}\ge t>0\), we have \(\textrm{I}(\textbf{C})\le \textrm{I}(\textbf{C}')\), where \(\textbf{C}'\) is a version of \(\textbf{C}\) except \(c_{i,i}'=c_{i,i}+t\) and \(c_{i,j}'=c_{i,j}-t\).
We can define strict monotonicity similarly, by replacing \(\le \) and \(\preceq \) with < and \(\prec \)(i.e., \(\preceq \) and not \(=\)), respectively, in the above definition.
For a diagonally max-dominant matrix \(\textbf{C}\), we have \(\max _{\sigma \in \mathfrak {S}_k} \sum _{i=1}^k c_{i,\sigma (i)} = \sum _{i=1}^k c_{i,i}\) and \(\max _{\sigma \in \mathfrak {S}_k} \sum _{i=1}^k c_{i,\sigma (i)}/c_{i,\cdot } = \sum _{i=1}^k c_{i,i}/c_{i,\cdot }\). Therefore, it is easily seen that A, NA, CA, and NCA are naturally strictly monotonic. Moreover, improving the memberships in the i-th true cluster always results in the same change of the index: they increase linearly (the increment depends on \(c_{i,\cdot }\) for CA and NCA).
Let us now render some counterexamples showing that [MON] does not hold for other indices studied herein.
R, FM, AR, AFM, AMI assume that input matrices are integer. Our correction for cluster sizes that guarantees [SC] results in the violation of [SYM]. Note that [E0] and [B0] are mutually exclusive
Example 11
For any integer \(s_1,\dots ,s_k \ge 1\), let \(\textbf{C}\in \mathfrak {C}^{k\times k}\) be a randomly generated matrix whose i-th row is obtained using the following procedure:
Generate \((u_1,\dots ,u_k)\sim \textrm{Dir}(1, \dots , 1)\) (a sample from a Dirichlet distribution).
Let \(c_{i,j}=\max \{1, \lfloor u_j s_i \rfloor \}\) for \(j=2,\dots ,k\) and then \(c_{i,1}=s_i-(c_{i,2}+\dots +c_{i,k})\).
This guarantees that \(\textbf{C}\) is diagonally max-dominant, has positive integer elements, and its consecutive row sums are \(s_1,\dots ,s_k\).
We generate 10, 000 random matrices like \(\textbf{C}\) with \(k=4\) and the row sums \(c_{1,\cdot }=c_{2,\cdot }=c_{3,\cdot }=100\), and \(c_{4,\cdot }=700\), and then improve the membership of only one point, obtaining \(\textbf{C}'\) with \(c_{1,1}'=c_{1,1}+1\) and \(c_{2,1}'=c_{2,1}-1\). Figure 2 presents a box and whisker plot depicting the empirical distribution of the 10, 000 differences \(\textrm{I}(\textbf{C}')-\textrm{I}(\textbf{C})\) for each index \(\textrm{I}\). If any increment is negative, and this is the case for all the indices but A, CA, NA, and NCA, then we conclude that [MON] does not hold. Moreover, we note how unpredictably the indices respond to such a simple adjustment of the confusion matrix.
3.9 Discussion
Table 2 summarises which index enjoys each of the above properties. Overall, the proposed measure – normalised clustering accuracy (NCA) – fulfils all properties except [SYM] and [E0].
As we have already noted, losing [SYM] is the price we pay for enforcing the invariance to cluster sizes [SC] using the employed normalisation.
Moreover, a nontrivial index cannot have the expected value of 0 for random partitions ([E0]) if it is bounded from below by 0 ([B0]). As far as matrices following the hypergeometric model are concerned, Fig. 3 depicts how the expected values change as a function of k in the case of clusters of equal sizes. Overall, for all indices except R, R\('\), and MI, we observe a decreasing trend.
Fig. 3
Expected values (based on 1, 000 Monte Carlo samples) of indices as functions of k for \(n=100k^2\), reference partitions of equal cardinalities, and predicted labels assigned at random (the hypergeometric model). Note the logarithmic scale on the y-axis. AMI, AR, and AFM are adjusted for chance (have expectation 0) and hence were omitted. Recall that, under \(c_{1,\cdot }=\dots =c_{k,\cdot }\), NCFM\('\)=NFM\('\), NCR\('\)=NR\('\), NCMI=NMI, NCA=NA, and CA=A. For indices fulfilling [U0] (i.e., those in the right subfigure), we additionally observe that, for all k, their expected decrease towards 0
To show that most of the properties are actually mild and do not contradict one another, let us consider the index \(\textrm{Z}\) given by:
$$\begin{aligned} \textrm{Z}(\textbf{C})=\left\{ \begin{array}{ll} 1 &{} \text {if }\textbf{C}=\textbf{P}_\sigma \textbf{S}\text { for some }\sigma \in \mathfrak {S}_k\text { and }\textbf{S}=\textrm{diag}(s_1,\dots ,s_k), \\ 0 &{} \text {otherwise}. \end{array} \right. \end{aligned}$$
(39)
Z fulfils all the properties considered except [E0] (however, it is only weakly, not strictly, monotone).
Let us now gain some insight into how to interpret concrete index values.
Example 13
Let \(\mathfrak {U}^{k\times k}_l\) represent the class of block diagonal matrices of size \(k\times k\) with \(l\le k\) blocks, each consisting of identical values, e.g.:
which represents the identification of l subclusters. In particular, \(\mathfrak {U}^{k\times k}_k\) are identity matrices and \(\mathfrak {U}^{k\times k}_1\) consist of matrices that we denoted by \(\textbf{U}^{k\times k}_{s,\dots ,s}\) (24). For such matrices, MI, AMI, N(C)MI give different values for each l, R\('\) and R are rather hard to interpret, however:
A, BA, CA, FM\('\), and approximately FM yield l/k,
NA, NBA, NCA, N(C)R\('\), N(C)FM\('\), and approximately AR and AFM output \((l-1)/(k-1)\).
We have already said that clustering is not classification: clusters are defined up to a permutation of set IDs. The price we pay for normalisation is the loss of 1 “degree of freedom” in the latter formula. This is what makes the normalised measures somewhat less intuitive.
From this perspective, NA and NCA have the most appealing interpretation, because they can be equivalently rewritten as classification rates (averages of proportions of correctly classified points in each cluster) above the random cluster membership assignment and with the optimal matching of cluster IDs:
with the difference being that NA relates the current partition to the one where all true clusters are of the same size.
Fig. 4
Behaviour of selected indices when moving from all points in a single cluster through the perfect match to the uniform assignment, one point at a time (\(k=2\), clusters of equal sizes; hence, NA=NCA, NR\('\)=NCR\('\), etc.). Only N(C)A yields a predictable, linear response. Note that N(C)R\('\) and N(C)FM\('\) are very similar: their curves are almost overlapping
Let us study the behaviour of the normalised indices when we transition between the case of all points allocated to one predicted cluster (\(\textbf{O}^{k\times k}_{s_1,\dots ,s_k}\)), through all points correctly classified (identity matrix), to the uniform assignment (\(\textbf{U}^{k\times k}_{s_1,\dots ,s_k}\)), one point at a time. Figure 4 depicts the case of \(k=2\) and \(s_1=s_2=54\):
1.
We start with all the points allocated to cluster 1. All indices fulfil [O0]; therefore, the reported similarities are equal to 0.
2.
We move the 54 points from the second reference cluster to its own group. For all indices but NCA, the response is initially slow, and speeds up only at the end.
3.
All indices enjoy [B1]; therefore, we reach the similarity of 1.
4.
We move 13 (ca. 25%) points back to cluster 1.
5.
We spread the points from cluster 1 uniformly.
6.
We spread the remaining points from cluster 2 uniformly, arriving at similarity 0 because of [U0].
Figure 5 illustrates \(k=3\) and \(s_1=s_2=s_3=24\):
1.
We start with all the points allocated to a single cluster.
2.
We move all the points in the third true cluster to its own group.
3.
We do the same with points from the second group until we reach the perfect match. Notice that NCMI, NCR\('\), and NCFM\('\) initially decrease.
4.
We move 12 points from the true cluster 1 to cluster 2.
5.
We relocate 12 points from the true cluster 2 to cluster 1.
6.
We move 8 points from the true cluster 1 to cluster 3 so that they are spread uniformly.
7.
We do the same in cluster 2.
8.
And similarly in cluster 3.
Only A, NA, CA, and NCA change uniformly when improving the cluster memberships.
4 Experiments
Let us compare the relationships between the indices on benchmark data. We consider 65 datasets from the paper by Gagolewski (2022) that consist of up to 10, 000 points and whose labels do not include any noise points, namely, from the FCPS battery (Thrun & Ultsch, 2020): atom, chainlink, engytime, hepta, lsun, target, tetra, twodiamonds, wingnut; from Graves (Graves & Pedrycz, 2010): dense, fuzzyx, line, parabolic, ring, ring_outliers, zigzag; from SIPU (Fränti & Sieranoja, 2018): a1, a2, a3, aggregation, compound, d31, flame, jain, pathbased, r15, s1, s2, s3, s4, spiral, unbalance; from UCI (Dua & Graff, 2022): ecoli, glass, ionosphere, sonar, statlog, wdbc, wine, yeast; Other: iris, iris5, square; and from WUT: circles, cross, graph, isolation, labirynth, mk1, mk2, mk3, mk4, olympic, smile, stripes, trajectories, trapped_lovers, twosplashes, windows, x1, x2, x3, z1, z2, z3. Each dataset comes with at least one reference partition, which we related, using all the external cluster validity measures studied herein, to the outputs of 10 algorithms: classical agglomerative hierarchical clustering methods based on the Single, Average, Complete, Ward, Centroid, and Median linkages (Müllner, 2011) and algorithms implemented in scikit-learn for Python (Pedregosa, 2011): K-Means, expectation-maximisation (EM) for Gaussian mixtures (n_init=100), Birch (threshold=0.01, branching_factor=50), and Spectral (affinity=Laplacian, gamma=5). In each case, the ground truth labels determine the number of subsets k to be detected, which is given as input to the algorithms. The spectral algorithm failed to converge on one dataset (WUT/x3 with \(k=3\)), so we excluded this benchmark scenario from further analysis.
Overall, we obtained \(74\cdot 10=740\) readings of each of the similarity scores. We will not take MI into account because it does not meet [B1].
Let us first relate the indices to each other pairwisely (95 cases of perfect agreement between true and predicted labels were removed as this trivially corresponds to all scores equal to 1). There is a very high degree of correlation between specific pairs (Pearson’s \(r > 0.999\)): R and R\('\); FM and FM\('\); AFM, NFM\('\), AR, and NR\('\); NCFM\('\) and NCR\('\); AMI and NMI. Therefore, we will only be considering the last index from each group from now on.
The least correlated pairs of indices, as measured with the Spearman’s rank correlation coefficient, which takes into account any monotonic relationships (not necessarily linear) are: FM\('\) and NCMI \((\rho \approx 0.751\)); FM\('\) and NCA \((\rho \approx 0.766\)); BA and NCMI \((\rho \approx 0.766\)); FM\('\) and R\('\)\((\rho \approx 0.772\)); R\('\) and CA \((\rho \approx 0.777\)).
Some non-normalised indices correlate quite strongly with their normalised counterparts. The following index pairs have \(\rho >0.95\): R\('\) and NR\('\) (\(\rho \approx 0.970\)); R\('\) and NMI (\(\rho \approx 0.963\)); CA and NCA (\(\rho \approx 0.960\)); A and NA (\(\rho \approx 0.951\)). Nevertheless, we note that the true partitions in our benchmark set have diverse cardinalities (ranging from 2 to 50), and the lower bounds of the non-normalised indices are a function of k.
Focusing on the normalised indices, the scatter plot matrix in Fig. 6 shows how different indices relate to one another.
Let us note that 29 of the 74 true label vectors define clusters of equal sizes. Overall, for 39 label vectors, the Gini index of true cluster sizes is less than 0.05 (with the average Gini index being 0.189). Therefore, in our study, the correction for cluster sizes has no or very small effect in many cases. This is why NR\('\) and NCR\('\), NMI and NCMI, as well as NA and NCA correlate highly with one another.
We note a high degree of rank correlation between: NCR\('\) and NCA; NCR\('\) and NCMI; NR\('\) and NA; NR\('\) and NMI; NBA and NA. Thence, we can try to express one index from each pair by a monotone function of another with not too big an error.
Fig. 6
Scatter plot matrix for selected normalised indices (we noted that some index groups are very highly correlated: AFM, NFM\('\), AR, and NR\('\); NCFM\('\) and NCR\('\); AMI and NMI). In the lower right and top left corners, respectively, Pearson’s r and Spearman’s \(\rho \) correlation coefficients are reported. Circa 53% of the true partitions in our data sample consist of clusters of almost equal sizes; hence, the correction for [SC] (e.g., transforming NA to obtain NCA) has little impact in these cases
We may also be interested in determining how the different external validity measures allow us to compare the overall quality of the 10 algorithms. Table 3 gives the rankings based on the median scores. Dasgupta and Ng (2009) and Gagolewski (2022) noted that one dataset can have many possible equally valid clusterings. Thus, in what follows, in the case of datasets with more than one reference labelling, for each algorithm, the best score is taken into account.
Only two methods, Gaussian mixtures and spectral clustering, are consistently ranked as the best ones.
If we employ Kendall’s \(\tau \), a correlation measure based on counting concordant and discordant object pairs, to assess the similarity of the rankings, we discover that the rankings generated by NMI and NCMI are the least correlated with those by other indices. For instance, they both rank the single linkage method third, whereas most other indices consider it the worst. The most similar pair is NA and NBA. Also, NCR\('\) correlates highly with NR\('\), NBA, NA, and NCA.
Table 3
Rankings of the 10 clustering algorithms based on median values (in round brackets) of different external cluster validity measures across the 64 benchmark datasets
NR\('\)
NCR\('\)
NMI
NCMI
NBA
NA
NCA
GaussMix
1 (0.80)
1 (0.79)
1 (0.80)
1 (0.83)
1 (0.82)
2 (0.85)
1 (0.87)
Spectral
2 (0.75)
2 (0.75)
2 (0.79)
2 (0.80)
2 (0.78)
1 (0.85)
2 (0.85)
Ward
4 (0.54)
4 (0.60)
7 (0.63)
7 (0.67)
5 (0.51)
5 (0.63)
3 (0.78)
Birch
3 (0.54)
3 (0.64)
4 (0.64)
5 (0.71)
4 (0.53)
4 (0.64)
4 (0.77)
KMeans
6 (0.51)
5 (0.58)
6 (0.64)
4 (0.72)
3 (0.54)
3 (0.68)
5 (0.72)
Average
5 (0.51)
6 (0.55)
5 (0.64)
6 (0.68)
6 (0.44)
6 (0.59)
6 (0.63)
Median
10 (0.37)
9 (0.51)
8 (0.57)
10 (0.60)
8 (0.41)
9 (0.55)
7 (0.63)
Centroid
7 (0.47)
7 (0.54)
9 (0.56)
8 (0.65)
7 (0.42)
7 (0.58)
8 (0.63)
Complete
9 (0.40)
8 (0.51)
10 (0.53)
9 (0.63)
9 (0.40)
8 (0.56)
9 (0.57)
Single
8 (0.45)
10 (0.46)
3 (0.76)
3 (0.74)
10 (0.28)
10 (0.44)
10 (0.43)
Let us stress that the purpose of this experiment is not to discover good algorithms, but to assess the effect of index choice on the rankings
5 Partitions of Different Cardinalities
Many indices can be extended to the case of confusion matrices with the number of rows k being different from the number of columns \(k'\). For instance, in the definitions of R, FM, MI, and their derivatives, we can replace the summation \(\sum _{i=1}^k \sum _{j=1}^k\) with \(\sum _{i=1}^k \sum _{j=1}^{k'}\).
We can generalise the normalised clustering accuracy as:
under the assumption \(c_{i,j}=0\) for \(j>k'\). If \(k<k'\), some predicted clusters are not matched with true ones. Hence, they are not counted, which decreases the computed similarity. If \(k>k'\), we treat the missing \(k-k'\) predicted clusters as empty: the number of points matched is zero. Overall, this version of the index penalises a clustering algorithm that outputs a grouping of cardinality \(k'\) different from the desired k.
Yet, we might also be interested in the case where \(k'\ne k\) is what we actually ask all the benchmarked algorithms for. In such a case, we can consider:
We leave the exploration of the extensions of the indices to the case \(k\ne k'\) and their properties as a topic of further research, because it deserves a more exhaustive treatment, for which no room is left in the current contribution.
6 Conclusion
We reasoned that using symmetric partition similarity scores to compare predicted partitions with fixed reference ones might not be ideal. Thus, we proposed a new measure called normalised clustering accuracy, which uses optimal cluster matching and is corrected for cluster sizes. We showed that it enjoys a number of desirable properties, including scale invariance, boundedness, and monotonicity. As far as interpretability is concerned, e.g., NCA=0.5 means that above the perfectly uniform cluster membership assignment, on average, 50% of points in each cluster are correctly discovered.
As a topic for further research, we would like to use some of the proposed properties as a basis for characterising certain indices. We would also like to study other desirable features discussed in the literature (Hennig, 2015; Wagner & Wagner, 2006; Warrens & van der Hoef, 2022; Rezaei & Fränti, 2016; Luna-Romera et al., 2019; Gates et al., 2019; Arinik et al., 2021; Xiang, 2012). It will also be interesting to derive the formula for the expected value of \(\max _\sigma \sum _i c_{i,\sigma (i)}\) and \(\max _\sigma \sum _i c_{i,\sigma (i)}/c_{i,\cdot }\) in the hypergeometric and other random models.
Furthermore, we will extend our notes from Sect. 5 to the case where the predicted clustering can be finer- or coarser-grained than the reference one (partitions of different cardinalities), including whole cluster hierarchies. We will also formulate similar properties that are more tailored to comparing fuzzy/soft/probabilistic (e.g., overlapping) or other types of partitions (Horta & Campello, 2015; Campagner et al., 2023; Andrews et al., 2022; D’Ambrosio et al., 2021; Hüllermeier et al., 2012), also in the case where the reference clustering is not crisp (compare Flynt et al., 2019).
Moreover, we may want to study similar metrics adjusted to the problem of semi-supervised learning, i.e., where some cluster memberships are known to the algorithm a priori.
Acknowledgements
The author is indebted to the Editor and Reviewers for providing numerous insightful remarks that led to the improvement of earlier versions of this paper.
Declarations
Conflict of Interest
The author declares no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A.1 Proof that NCR\('\) and NCFM\('\) Fulfil [B1] and [B0]
As \(\sqrt{ \sum _{i=1}^k c_{i, \cdot }^2 \, \sum _{j=1}^k c_{\cdot , j}^2 } \le \frac{ \sum _{i=1}^k c_{i, \cdot }^2+ \sum _{j=1}^k c_{\cdot , j}^2 }{2}\), we have \(\textrm{NR}'(\textbf{C}) \le \textrm{NFM}'(\textbf{C})\le 1\). The latter inequality is due to the fact that \( \sum _{i=1}^k \sum _{j=1}^k c_{i,j}^2 \le \sqrt{ \sum _{i=1}^k c_{i, \cdot }^2 \, \sum _{j=1}^k c_{\cdot , j}^2 } \), which holds because \(\sum _{i=1}^k c_{i, \cdot }^2 = \sum _{i=1}^k \sum _{j=1}^k c_{i,j}^2 + \sum _{i=1}^k \sum _{u\ne v} c_{i,u}c_{i,v} \) and \(\sum _{j=1}^k c_{j, \cdot }^2 = \sum _{i=1}^k \sum _{j=1}^k c_{i,j}^2 + \sum _{j=1}^k \sum _{u\ne v} c_{u,j}c_{v,j} \). This also easily implies that \(\textrm{NR}'(\textbf{C}) = \textrm{NFM}'(\textbf{C})= 1\) if and only if \(\textbf{C}=\textbf{P}_\sigma \textbf{S}\) for some \(\sigma \in \mathfrak {S}_k\) and \(\textbf{S}=\textrm{diag}(s_1,\dots ,s_k)\). The same holds for NCR\('\) and NCFM\('\). Hence, they meet property [B1].
To show that NCFM\('\) and NCR\('\) are nonnegative, it suffices to prove that \( \frac{ \sum _{i=1}^k c_{i, \cdot }^2\, \sum _{j=1}^k c_{\cdot , j}^2 }{n^2} \le \sum _{i=1}^k \sum _{j=1}^k c_{i,j}^2 \) under \(c_{1,\cdot }=\dots =c_{k,\cdot }=1\), i.e., for matrices adjusted for cluster sizes. As \(n=k\), the inequality can be rewritten as \( \sum _{j=1}^k \left( \sum _{i=1}^k c_{i, j}\right) ^2 \le \sum _{j=1}^k \left( k \sum _{i=1}^k c_{i,j}^2\right) \) and that \(\left( \sum _{i=1}^k c_{i, j}\right) ^2\le k \sum _{i=1}^k c_{i,j}^2\) holds for all j is implied by the Cauchy-Schwarz inequality which states that \(\textbf{u}\circ \textbf{v}\le \Vert \textbf{u}\Vert \, \Vert \textbf{v}\Vert \) with \(\textbf{u}=(c_{1,j}, \dots , c_{k,j})\) and \(\textbf{v}=(1,\dots ,1)\). As NCFM\('\) and NCR\('\) fulfil [U0] and [O0] (compare Table 1), this implies that they also meet [B0].
A.2 Proof that \(\textrm{A}\) Is Bounded from Below by 1/k
We want to prove that for all \(\textbf{C}\in \mathfrak {C}^{k\times k}\), we have \(\textrm{A}(\textbf{C})=\max _{\sigma \in \mathfrak {S}_k} \sum _{j=1}^k \frac{c_{\sigma (j), j}}{n} \ge 1/k\).
Let \(\sigma _1\in \mathfrak {S}_k\) be the permutation that maximises \(\sum _{j=1}^k {c_{\sigma _1(j), j}}\). We have to show that \(k \sum _{j=1}^k {c_{\sigma _1(j), j}} \ge \sum _{i=1}^k \sum _{j=1}^k c_{i,j}=n\). Take any other \(k-1\) permutations \(\sigma _2,\dots ,\sigma _k\in \mathfrak {S}_k\) such that the set \(\{ (\sigma _i(j), j) \}_{i,j=1,\dots ,k}\) consists of all possible index pairs, \(\{1,\dots ,k\}\times \{1,\dots ,k\}\).
As \(\sigma _1\) is the maximal permutation, we have \(\sum _{j=1}^k {c_{\sigma _1(j), j}} \ge \sum _{j=1}^k {c_{\sigma _i(j), j}} \) for all i. Hence, \( k \sum _{j=1}^k {c_{\sigma _1(j), j}} \ge \sum _{i=1}^k \sum _{j=1}^k c_{\sigma _i(j),j} =\sum _{i=1}^k \sum _{j=1}^k c_{i,j}\), QED.
Note that this also implies that CA is bounded from below by 1/k and that NCA is bounded from below by 0. These are actually minima; as they are attained at \(\textbf{U}_{s_1,\dots ,s_k}^{k\times k}\); see Table 1.
One of the reviewers of this paper noted that “[B0] is as much a convention (rather than a desirable feature) as the maximum of 1 is. Allowing for negative values is not a reason against adjusted Rand for example. In fact, [B0] implies that an index cannot have the expected value 0 on random partitions, and the latter can be seen as desirable (later covered as [E0]).” We acknowledge that in certain applications, [E0] might indeed be more desirable than [B0].
We shall relax this condition below. Furthermore, the more complicated case where the number of reference and predicted clusters might differ will also be considered.
One of the reviewers disagreed with this statement: “I actually find it very desirable that the latter value tells me that this is worse than average.” We acknowledge that in certain applications, it might indeed be more welcome a behaviour.
Interestingly, Chacón (2021) suggested that if a measure is not normalised, the lower bound should be provided alongside the index value. This way, we can report, e.g., “A=0.53 (min=0.5 for \(k=2\))”, indicating that the similarity score is close to the worst-case scenario.
We note that this matrix is equal to the expected value of the confusion matrix in the hypergeometric model assuming given row sums \(s_1,\dots ,s_k>0\) and each column sum equal to n/k. However, in our setting, the sense of a uniform assignment of the predicted cluster memberships is purely algebraic; we do not generate the matrices at random. Similarly, Chacón (2021) notes that this corresponds to “the situation where the labels of the first clustering are perfectly independent of the labels in the second clustering”.
If we had an index that does not accept matrices with column sums of 0, we would introduce \(\textbf{O}=(o_{i,j})\) as having \(o_{i,i}=\varepsilon >0\) and \(o_{i, 1}=s_i-\varepsilon \) for all \(i\ge 2\) and some infinitesimal \(\varepsilon >0\). However, all our examples are well-defined for \(\textbf{O}\)s given by Eq. 25, which is, by all odds, a much simpler setting.
Note again that, for simplicity, we consider real-valued confusion matrices. The minimum in the case of crisp memberships was derived by Charon et al. (2006). It is equal to \((\lceil n/k\rceil )/n\) and it is approximately equal to 1/k for large n.