On biological validity indices for soft clustering algorithms for gene expression data
Introduction
Unsupervised clustering methods have been widely applied to the analysis of gene expression data to identify biologically relevant groups of genes. Algorithms such as K-means, hierarchical clustering and self-organizing maps (SOM) are the popular hard clustering methods, which assign each gene to only one cluster. However, this restriction may not be appropriate for analyzing gene expression profiles because a single gene might be involved in multiple functional categories. In contrast, a soft clustering algorithm, such as fuzzy c-means, assigns one gene to multiple clusters according to their degrees of membership and thus gives more information on gene multi-functionalities (Dembele and Kastner, 2003).
To assess the quality and reliability of the clusters produced by a clustering algorithm, a variety of cluster validity indices have been proposed such as the Dunn index (Dunn, 1973), the adjusted Rand index (Hubert and Arabie, 1985), the silhouette width (Rousseeuw, 1987), the figure of merit (FOM) (Yeung et al., 2001), Hennig’s stability index (Hennig, 2007) (all primarily intended for hard clusters), and the Xie–Beni index (Xie and Beni, 1991) and Qiu and Joe’s separation index (Qiu and Joe, 2006) (primarily for fuzzy clusters). Some of these indices are the most popular statistical indices for gene expression data analysis. Refer to Handl et al. (2005) for an overview of cluster validation measures for post-genomic data. On the other hand, recent studies have suggested that the incorporation of biological information, such as functional and/or curated annotations, into validation methods might be useful for supporting biological and biomedical discoveries. For example, Bolshakova et al. (2005) presented a knowledge-driven approach for assessing the cluster validity based on similarities extracted from gene ontology (GO). Park et al. (2005) used a genetic algorithm for fuzzy clustering and prior knowledge of experimental data for their evaluation. Datta and Datta (2006) made use of the biological information along with gene expression data and proposed two indices, the biological homogeneity index (BHI) and the biological stability index (BSI), for the biological evaluation of clustering algorithms. In summary, these indices attempt to produce clustering results with good statistical properties, such as compactness, well-separatedness, connectedness and stability, and also attempt to provide more biologically relevant results.
For soft clustering algorithms, a common way to biologically evaluate the resulting class memberships is to select the class label with the highest membership value so that the existing indices can be applied. However, bio-indices that are designed specifically for hard clustering methods may not be sufficient when applied to the soft clustering of genes since some genes may belong to several statistical clusters. A biological evaluation index that considers the class membership is necessary for the correct evaluation of soft clustering algorithms. In the present study, we generalize the BHI and the BSI to quantify the abilities of soft clustering algorithms to produce biologically meaningful clusters using a reference set of functional classes. The indices investigated are the soft biological homogeneity index (SBHI) and the soft biological stability index (SBSI). To the best of our knowledge, no biological validation indices have been proposed thus far for soft clustering algorithms.
We evaluated the performances of several existing soft clustering algorithms in R (R Development Core Team, 2006), including fuzzy c-means clustering (Bezdek, 1981), fuzzy c-shell clustering (Dave, 1996), model-based clustering (Fraley and Raftery, 2002) and consensus clustering (Monti et al., 2003), on two simulated and three gene expression data sets and identified the optimal algorithm for each number of clusters. A biological reference set for the annotated genes of the relevant species was obtained from the gene ontology database. The proposed indices are helpful for selecting the optimal algorithm from a class of soft clustering algorithms for the given data sets. We provided an R code for computing these indices, whereas the function fclustIndex in package e1071 (Dimitriadou et al., 2006) has several statistical validation measures for fuzzy clusters.
This paper is structured as follows. Section 2 introduces some soft clustering methods to be evaluated, which are available in R. The classical biological validation indices for hard clusters, the BHI and the BSI, are given in Section 3. Section 4 describes the generalized indices, the SBHI and the SBSI, for soft clustering algorithms and discusses their significance. Section 5 provides examples using two simulated and three gene expression data sets. We then conclude the paper in Section 6.
Section snippets
Soft clustering methods
In this section, we give a brief description of several existing soft clustering methods and their availability in R. These algorithms will be evaluated by the proposed indices. Let be the data set to be analyzed, consisting of data points in -dimensional space, and let be a set of cluster centers. Let be an fuzzy partition matrix where is the membership degree of a point in cluster . Let be a fuzziness index (often ).
Biological evaluation indices for hard clustering methods
In this section, we describe the BHI and the BSI, both of which were originally presented in Datta and Datta (2006). Let be a set of functional classes that are not necessarily disjoint, and let be the functional class containing gene . Note that a gene may be involved in multiple functional categories.
Biological evaluation indices for soft clustering methods
Based on the BHI and the BSI, we generalize the Eqs. (7), (8) to evaluate soft clustering algorithms, using the degree of memberships as the weight in the formulation. For the hard clusters, if gene belongs to cluster and 0 otherwise. For simplicity, is omitted from in the following.
Examples
Several soft clustering algorithms freely available in R were evaluated for two simulated and three biological data sets for varying numbers of clusters. The algorithms evaluated include fuzzy c-means (cmeans and fanny), fuzzy c-shell clustering (cshell), model-based clustering (Mclust), consensus clustering (cl_bag) and random clustering (random). The one-minus Pearson correlations (C) were used as a dissimilarity measure for cmeans and fanny. The standard Euclidean distance (E) between
Conclusion
Existing biological knowledge, such as the GO database, can assist in the cluster validation process. In this study, we generalized the biological homogeneity index and the biological stability index by considering the resulting class memberships to formulate a soft biological homogeneity index and a soft biological stability index for evaluating soft clustering algorithms on gene expression data. These fuzzified indices provide better and more precise biological performance measures than the
Acknowledgements
The author thanks the associate editor and the referees for their valuable comments and suggestions. This research was supported by grants from the National Science Council of Taiwan, ROC (NSC 97-2118-M-032-001).
References (28)
- et al.
Neural crest and mesoderm lineage-dependent gene expression in orofacial development
Differentiation
(2007) - et al.
A genome-wide transcriptional analysis of the mitotic cell cycle
Molecular Cell
(1998) Cluster-wise assessment of cluster stability
Computational Statistics & Data Analysis
(2007)- et al.
Separation index and partial membership for clustering
Computational Statistics & Data Analysis
(2006) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
Journal of Computational and Applied Mathematics
(1987)- et al.
Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression
Breast Cancer Research
(2004) Pattern Recognition with Fuzzy Objective Function Algorithms
(1981)- et al.
A knowledge-driven approach to cluster validity assessment
Bioinformatics
(2005) - et al.
clValid: an R package for cluster validation
Journal of Statistical Software
(2008) - et al.
Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes
BMC Bioinformatics
(2006)
Fuzzy shell-clustering and applications to circle detection in digital images
International Journal of General Systems
Fuzzy c-means method for clustering microarray
Bioinformatics
Bagging to improve the accuracy of a clustering procedure
Bioinformatics
Cited by (10)
Multiple hypothesis testing and clustering with mixtures of non-central t-distributions applied in microarray data analysis
2012, Computational Statistics and Data AnalysisA New Cluster Validity Index for Fuzzy Clustering Using Separation and Compactness
2023, Research SquareOn Gradient Weighting Based Fuzzy Clustering for Graph Data
2017, Proceedings - 13th International Conference on Computational Intelligence and Security, CIS 2017A novel point density based validity index for clustering gene expression datasets
2017, International Journal of Data Mining and BioinformaticsVanishing-point detection based on a fuzzy clustering algorithm and new clustering validity measure
2015, Journal of Applied Science and Engineering