In this section we outline the concept and the experimental design of our study. We first explain the three different categories of over-optimization mechanisms that we illustrate in our study. We then detail our concrete implementation, e.g., the clustering algorithms, datasets, evaluation measure and optimization method.
3.1 Three categories of over-optimization
Imagine a researcher who wishes to present his/her cluster algorithm in a favorable light. We model the work process of this researcher as an “optimization task”: the characteristics of the study in which the new algorithm is compared to existing ones are optimized such that the researcher’s algorithm scores well, in particular better than the best performing competing algorithm. This optimization can refer to (1.) finding datasets or data characteristics for which the new algorithm works particularly well, (2.) finding optimal parameters of the algorithm (and vice versa, neglecting the search for optimal parameters for the competitors) or (3.) choosing specific competing algorithms.
Optimizing datasets or data characteristics. A new cluster algorithm might perform well for specific types of datasets, but not for other types. Researchers might decide to report only the best-performing types of datasets. Additionally, for synthetic datasets, there is potential for over-optimism when varying specific characteristics (e.g., the amount of noise, the sample size, or the number of dimensions), and reporting only the optimal settings. Moreover, simulated datasets depend on the random seed, such that in turn, the performance of the cluster algorithm might also vary over different random seeds. Researchers might actively look for a “good” random seed or simply stumble across a particular “good” random seed by chance, neglecting to try other random seeds to check for robustness.
Optimizing the algorithm’s parameters or characteristics. Hyperparameters of the cluster algorithm, or characteristics of the algorithm designed during the development phase, could be varied by researchers to look for the best result. Hyperparameter optimization (HPO) is per se a legitimate procedure in performance evaluation. However, there is less awareness for proper evaluation of cluster algorithms combined with HPO, compared to the more extensive methodological literature on correct evaluation of supervised classifiers with HPO (Boulesteix et al.
2008; Bischl et al.
2021). In cluster analysis, over-optimism in relation to HPO may result from (1.) optimizing hyperparameters based on the “true” cluster labels known to the researchers, and (2.) not splitting the data into training and test sets. Both aspects will be discussed in more detail in Sects.
4 and
5. Moreover, over-optimism might also result when researchers neglect to set optimal parameters for the
competing algorithms, e.g., when choosing suboptimal hyperparameter defaults for the competitors while finetuning their own algorithm.
Optimizing the choice of competing algorithms. Finally, researchers might pick specific competing clustering methods that let their own algorithm appear in a better light. They could neglect to look for the best state-of-the-art competitor, instead opting for less optimal comparison algorithms. Even if the researchers are aware of state-of-the-art competitors, they might not include them because the codes are not openly available, or implemented in a programming language which they are not familiar with. Researchers could also think of different groups of competing cluster algorithms, and then pick the group that is most favorable for comparison with their own algorithm. A new density-based cluster algorithm could for example be compared either with a group of other density-based algorithms, or with a group of some well-known, not necessarily density-based cluster algorithms. While both choices could in principle be sensible, it is over-optimistic if researchers either deliberately exclude a class of competitors a priori because they expect their novel algorithm to perform worse than this class, or if they choose the competitor group a posteriori after having seen the results (Jelizarow et al.
2010).
Apart from these three categories of optimization, there are some further optimization possibilities (e.g., optimizing the evaluation measure) that we do not analyze here in detail, but briefly discuss in Sect.
5.
We assume that usually, researchers do not consciously perform the three classes of optimization tasks in a malicious and systematic manner. Nevertheless, in the course of a longer research process during which researchers try different datasets, algorithm parameters/configurations and competing algorithms, researchers might optimize the settings in an unsystematic and (probably) unintentional manner. Even if researchers start their analysis with the best intentions, they might post-hoc rationalize their (over-optimistic) choices as perfectly reasonable decisions, given that “[h]umans are remarkably good at self-deception” and scientists often “fool themselves” (Nuzzo
2015).
One might argue that the optimizations outlined above are not actually
over-optimizations and that it is perfectly fine to look for scenarios in which a novel algorithm performs well. We would agree that it is not a priori wrong to search for and report such scenarios, as a new cluster algorithm can never be expected to outperform every other cluster algorithm in every situation. However, it should also be transparently reported how the presented “successful” scenarios were obtained, and how the algorithm performs in other settings. Over-optimism ultimately appears when performance results are
selectively reported. We will illustrate this with our results in Section
4.
3.2 Experimental setup
In accordance with the authors, we used the already published algorithm Rock (Beer et al.
2019) as a novel and promising algorithm. Rock is an iterative approach similar to Mean Shift (Fukunaga and Hostetler
1975), but based on the
k nearest neighbors (kNN) instead of the bandwidth. In each step, points “roam” to the mean of their respective
k nearest neighbors. Points with a similar final position are assigned to a common cluster. The algorithm involves the hyperparameter
\(t_{max}\), which gives the maximum number of iterations. As the maximum meaningful value for
k is fixed (
\(k>\frac{n}{2}\) would lead to an assignment of all points to the same cluster), and the increase of
k in every step is linear,
\(t_{max}\) also determines the number
k of nearest neighbors regarded in each iteration. The larger
\(t_{max}\) is chosen, the closer values for
k are in consecutive steps. Lower values for
\(t_{max}\) thus lead to larger gaps between consecutive values for
k, which may cause volatile merges of different clusters. On the other hand, higher values for
\(t_{max}\) lead to more iterations, which increases runtime.
As typical for short papers, only a limited number of experiments is presented in Beer et al. (
2019), illustrating that the underlying idea is promising. The results for Rock looked good compared to
k-Means (Lloyd
1982), DBSCAN (Ester et al.
1996) and Mean Shift, which are typical competitors in the field and representatives for algorithms finding different types of clusters. As examples for competing algorithms, we thus chose
k-means, DBSCAN, Mean Shift and additionally Spectral Clustering (Ng et al.
2001).
As the clustering performance measure we use the Adjusted Mutual Information Score (AMI, Vinh et al.
2010), a version of the Mutual Information (MI) Score adjusted for chance agreement of random partitions. For each dataset and cluster algorithm, the known “true” clustering (as given either by the simulation design for the synthetic datasets or by additional label information for the real datasets) was compared via the AMI with the clustering found by the algorithm. The higher the AMI, the more similar the two clusterings are. The AMI attains its maximum value of 1 if the two clusterings are identical, and equals 0 if the MI between the two clusterings is equal to the MI value expected for two random partitions. We give the detailed mathematical definition of the AMI in the appendix
A.
While we only use the AMI in our illustration for the sake of conciseness, a similar analysis could be performed for alternative indices which measure the agreement of the calculated clusterings with the “ground truth”, or even for internal validation indices which evaluate clusterings based on internal properties of the data alone and do not require the “ground truth” (see also the discussion in Sect.
5.2).
The choice of exemplary datasets is linked to the three different optimization tasks outlined in Sect.
3.1. We thus give the datasets for each task in turn and explain how the optimization was performed. Note that we performed the three optimization tasks sequentially, building on the results of each previous task. Of course, in reality, a researcher will likely not perform the optimizations in such a perfectly sequential matter, and might jump between different tasks of optimization or try to optimize different aspects simultaneously. Again, our sequential procedure merely serves illustrative purposes.
For some specific details of the implementation, we refer to the appendix
A.
Optimizing datasets and data characteristics. For this part of the analysis, we chose three commonly used different synthetic datasets from scikit-learn (Pedregosa et al.
2011), see Fig.
2: Two Moons
1, Blobs
2 (for details on this dataset, see the appendix
A), and Rings
3.
First, we performed optimization by varying the following data characteristics: a) for Two Moons, the sample size and the jitter values (where “jitter” denotes small random perturbations to the original data points in the clusters), b) for Blobs, the sample size, the number of dimensions and the number of generated clusters (“blobs”), and c) for Rings, the sample size and the jitter values. The goal of the optimization was to find the parameter configuration (e.g., for Two Moons, the configuration (n, j) of sample size and jitter value) that yields the largest performance difference between Rock and the best of the competitors – which is not necessarily the parameter configuration that yields the best absolute performance of Rock.
That is, for each of the three types of synthetic datasets in turn, we performed the following formal optimization task:
$$\begin{aligned} \text {argmax}_{D \in {\mathcal {D}}} \left\{ \frac{1}{10} \sum _{i = 1}^{10} \Big ( AMI\left( Rock(D^i), y_{D^i}\right) \right. - \left. \text {max}_{C \in {\mathcal {C}}} AMI\left( C(D^i), y_{D^i}\right) \Big ) \right\} \nonumber \\ \end{aligned}$$
(1)
where
\(D \in {\mathcal {D}}\) denotes the different variants of the dataset. For example, for the Two Moons data, each dataset
D is a version of Two Moons with a specific jitter value and sample size. Each
D has a cluster label ground truth
\(y_D\). For each
\(D \in {\mathcal {D}}\), ten different versions of
D, namely
\(D^i, i = 1,\ldots ,10\) resulting from ten different random seeds were generated. Put differently, we performed ten simulation iterations per setting, i.e., we sampled ten datasets from each data distribution with a specific data parameter setting. The AMI difference is then averaged over these ten versions. This is supposed to reduce the influence of the random seed. Only at a later point in the analysis did we look at the effect of picking specific random seeds (see below).
\(Rock(D^i)\) denotes the application of Rock to the data
\(D^i\), returning a partition of the objects. Analogously, the competing algorithms
\(C \in {\mathcal {C}}\) return a partition of
\(D^i\), with
\({\mathcal {C}} = \{k\)-means, DBSCAN, Mean Shift, Spectral Clustering
\(\}\).
For each of the three types of datasets in turn, we performed the optimization task (
1) by using the Tree-structured Parzen Estimator (TPE, Bergstra et al.
2011), as implemented in the Optuna framework (Akiba et al.
2019) in Python
4. TPE is a Bayesian optimization (BO) method. BO approaches sequentially propose new parameter configurations based on a library of previous evaluations of the objective function (for more details on BO methods and the TPE, see the appendix
A). The TPE is often used for hyperparameter optimization of machine learning models, but in our case, we use it to optimize the
data parameters. The TPE optimization can be considered as a very simplified model of the researcher’s optimization procedure. Of course, a researcher’s behavior does not exactly correspond to the mathematical procedure of the TPE. However, if researchers perform
intentional (over-)optimization, then they might indeed use an optimization method such as the TPE to find the best data settings. The Bayesian optimization mimics the researcher’s (unintentional) over-optimization in the following sense: as mentioned above, a researcher developing a new cluster algorithm might sequentially look for data settings in which the new algorithm performs well, taking into account performance information from previously tried data parameters. This is the reason why we chose the TPE over a simple grid search or random search, because the latter do not use previously obtained performance information. To make the TPE process more “realistic”, we supplied a grid of limited discrete values to the TPE, given that a researcher presumably would not try arbitrary real numbers. We performed this experiment with only 100 optimization steps for each of the three types of datasets, in order to fairly represent a researcher trying different data parameters by hand.
After determining the optimal values for the data parameters (which we will later report in Table
1 in Sect.
4.1), we analyzed the performance of Rock for non-optimal parameter values. That is, for each dataset and single data parameter in turn, the parameter was varied over a list of values, while the other data parameters were kept fixed at their optimal values. For example, for the Two Moons dataset we tried different jitter values and plotted the corresponding performance as measured by the mean AMI over ten random seeds against the jitter, keeping the sample size at the optimal value determined by the TPE. These analyses show the effects of selectively reporting only the best data parameters versus the performance of the algorithm over a broader range of each data parameter.
In the experiments given so far, we always considered the AMI averaged over ten random seeds. In the final step of the analysis for this section, we specifically study the influence of individual random seeds. We take the Two Moons dataset as an example, with a data parameter setting which is not optimal for Rock, but for which DBSCAN performs very well. We generate 100 datasets with these characteristics by setting 100 different random seeds, to check whether there exist particular seeds for which Rock does perform well, leading to over-optimization potential.
For all experiments described so far, we applied reasonable parameter choices (defaults or heuristics) for the cluster algorithms. For Rock we chose
\(t_{max}=15\), as done for all experiments in the original paper (Beer et al.
2019), and for the competing algorithms see the appendix
A.
Optimizing the algorithm’s parameters or characteristics. For this example we varied Rock’s hyperparameter \(t_{max}\) (maximum number of iterations). As \(t_{max}\) is discrete with a reasonable range of \(\{1, \ldots , 30\}\), a researcher could easily try every value by hand. Thus we did not perform optimization with the TPE, but with a full grid search, i.e., we calculated the AMI performance of Rock for each value of \(t_{max}\) and for each dataset. For this illustration, we considered the absolute performance of Rock, given researchers would also strive to maximize the absolute performance of their novel algorithm.
As exemplary datasets, we again considered Two Moons, Blobs and Rings, and additionally four real datasets frequently used for performance evaluation: Digits, Wine, Iris and Breast Cancer as provided by scikit-learn
5 (see also the UCI Machine Learning Repository, Dua and Graff
2017). The data parameter settings for the three synthetic datasets (sample size, amount of jitter etc.) corresponded to the optimal settings from the TPE optimization of (
1). We used a single random seed to generate the illustrative synthetic datasets.
In a next step, using the Two Moons dataset as an example, we compared the AMI performances of Rock and DBSCAN over ten random seeds, first without, then with hyperparameter optimization for Rock and DBSCAN. We used the TPE for HPO of DBSCAN. Here, the TPE was not intended to model a researcher’s behavior, but was used as a classical HPO method. The comparison illustrates the effect of neglecting parameter optimization for competing algorithms.
Optimizing the choice of competing algorithms. We did not perform new experiments here. Rather, we looked at the results from the two previous optimization tasks to derive the potential for optimization of the choice of competing cluster algorithms.