On biological validity indices for soft clustering algorithms for gene expression data

https://doi.org/10.1016/j.csda.2010.12.003Get rights and content

Abstract

Unsupervised clustering methods such as K-means, hierarchical clustering and fuzzy c-means have been widely applied to the analysis of gene expression data to identify biologically relevant groups of genes. Recent studies have suggested that the incorporation of biological information into validation methods to assess the quality of clustering results might be useful in facilitating biological and biomedical knowledge discoveries. In this study, we generalize two bio-validity indices, the biological homogeneity index and the biological stability index, to quantify the abilities of soft clustering algorithms such as fuzzy c-means and model-based clustering. The results of an evaluation of several existing soft clustering algorithms using simulated and real data sets indicate that the soft versions of the indices provide both better precision and better accuracy than the classical ones. The significance of the proposed indices is also discussed.

Introduction

Unsupervised clustering methods have been widely applied to the analysis of gene expression data to identify biologically relevant groups of genes. Algorithms such as K-means, hierarchical clustering and self-organizing maps (SOM) are the popular hard clustering methods, which assign each gene to only one cluster. However, this restriction may not be appropriate for analyzing gene expression profiles because a single gene might be involved in multiple functional categories. In contrast, a soft clustering algorithm, such as fuzzy c-means, assigns one gene to multiple clusters according to their degrees of membership and thus gives more information on gene multi-functionalities (Dembele and Kastner, 2003).

To assess the quality and reliability of the clusters produced by a clustering algorithm, a variety of cluster validity indices have been proposed such as the Dunn index (Dunn, 1973), the adjusted Rand index (Hubert and Arabie, 1985), the silhouette width (Rousseeuw, 1987), the figure of merit (FOM) (Yeung et al., 2001), Hennig’s stability index (Hennig, 2007) (all primarily intended for hard clusters), and the Xie–Beni index (Xie and Beni, 1991) and Qiu and Joe’s separation index (Qiu and Joe, 2006) (primarily for fuzzy clusters). Some of these indices are the most popular statistical indices for gene expression data analysis. Refer to Handl et al. (2005) for an overview of cluster validation measures for post-genomic data. On the other hand, recent studies have suggested that the incorporation of biological information, such as functional and/or curated annotations, into validation methods might be useful for supporting biological and biomedical discoveries. For example, Bolshakova et al. (2005) presented a knowledge-driven approach for assessing the cluster validity based on similarities extracted from gene ontology (GO). Park et al. (2005) used a genetic algorithm for fuzzy clustering and prior knowledge of experimental data for their evaluation. Datta and Datta (2006) made use of the biological information along with gene expression data and proposed two indices, the biological homogeneity index (BHI) and the biological stability index (BSI), for the biological evaluation of clustering algorithms. In summary, these indices attempt to produce clustering results with good statistical properties, such as compactness, well-separatedness, connectedness and stability, and also attempt to provide more biologically relevant results.

For soft clustering algorithms, a common way to biologically evaluate the resulting class memberships is to select the class label with the highest membership value so that the existing indices can be applied. However, bio-indices that are designed specifically for hard clustering methods may not be sufficient when applied to the soft clustering of genes since some genes may belong to several statistical clusters. A biological evaluation index that considers the class membership is necessary for the correct evaluation of soft clustering algorithms. In the present study, we generalize the BHI and the BSI to quantify the abilities of soft clustering algorithms to produce biologically meaningful clusters using a reference set of functional classes. The indices investigated are the soft biological homogeneity index (SBHI) and the soft biological stability index (SBSI). To the best of our knowledge, no biological validation indices have been proposed thus far for soft clustering algorithms.

We evaluated the performances of several existing soft clustering algorithms in R (R Development Core Team, 2006), including fuzzy c-means clustering (Bezdek, 1981), fuzzy c-shell clustering (Dave, 1996), model-based clustering (Fraley and Raftery, 2002) and consensus clustering (Monti et al., 2003), on two simulated and three gene expression data sets and identified the optimal algorithm for each number of clusters. A biological reference set for the annotated genes of the relevant species was obtained from the gene ontology database. The proposed indices are helpful for selecting the optimal algorithm from a class of soft clustering algorithms for the given data sets. We provided an R code for computing these indices, whereas the function fclustIndex in package e1071 (Dimitriadou et al., 2006) has several statistical validation measures for fuzzy clusters.

This paper is structured as follows. Section 2 introduces some soft clustering methods to be evaluated, which are available in R. The classical biological validation indices for hard clusters, the BHI and the BSI, are given in Section 3. Section 4 describes the generalized indices, the SBHI and the SBSI, for soft clustering algorithms and discusses their significance. Section 5 provides examples using two simulated and three gene expression data sets. We then conclude the paper in Section 6.

Section snippets

Soft clustering methods

In this section, we give a brief description of several existing soft clustering methods and their availability in R. These algorithms will be evaluated by the proposed indices. Let X={x1,x2,,xn} be the data set to be analyzed, consisting of n data points in p-dimensional space, and let C={c1,c2,,cK} be a set of K cluster centers. Let U={uij} be an n×K fuzzy partition matrix where uij is the membership degree of a point xi in cluster j. Let m be a fuzziness index (often m=2).

Biological evaluation indices for hard clustering methods

In this section, we describe the BHI and the BSI, both of which were originally presented in Datta and Datta (2006). Let B={B1,,BF} be a set of F functional classes that are not necessarily disjoint, and let Bi be the functional class containing gene i. Note that a gene may be involved in multiple functional categories.

Biological evaluation indices for soft clustering methods

Based on the BHI and the BSI, we generalize the Eqs. (7), (8) to evaluate soft clustering algorithms, using the degree of memberships uik as the weight in the formulation. For the hard clusters, uik=1 if gene i belongs to cluster k and 0 otherwise. For simplicity, m is omitted from uijm in the following.

Examples

Several soft clustering algorithms freely available in R were evaluated for two simulated and three biological data sets for varying numbers of clusters. The algorithms evaluated include fuzzy c-means (cmeans and fanny), fuzzy c-shell clustering (cshell), model-based clustering (Mclust), consensus clustering (cl_bag) and random clustering (random). The one-minus Pearson correlations (C) were used as a dissimilarity measure for cmeans and fanny. The standard Euclidean distance (E) between

Conclusion

Existing biological knowledge, such as the GO database, can assist in the cluster validation process. In this study, we generalized the biological homogeneity index and the biological stability index by considering the resulting class memberships to formulate a soft biological homogeneity index and a soft biological stability index for evaluating soft clustering algorithms on gene expression data. These fuzzified indices provide better and more precise biological performance measures than the

Acknowledgements

The author thanks the associate editor and the referees for their valuable comments and suggestions. This research was supported by grants from the National Science Council of Taiwan, ROC (NSC 97-2118-M-032-001).

References (28)

  • R.N. Dave

    Fuzzy shell-clustering and applications to circle detection in digital images

    International Journal of General Systems

    (1996)
  • D. Dembele et al.

    Fuzzy c-means method for clustering microarray

    Bioinformatics

    (2003)
  • Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A., 2006. e1071: Misc Functions of the Department of...
  • S. Dudoit et al.

    Bagging to improve the accuracy of a clustering procedure

    Bioinformatics

    (2003)
  • Cited by (10)

    View all citing articles on Scopus
    View full text