On biological validity indices for soft clustering algorithms for gene expression data

doi:10.1016/j.csda.2010.12.003

Computational Statistics & Data Analysis

Volume 55, Issue 5, 1 May 2011, Pages 1969-1979

https://doi.org/10.1016/j.csda.2010.12.003 Get rights and content

Abstract

Unsupervised clustering methods such as K-means, hierarchical clustering and fuzzy c-means have been widely applied to the analysis of gene expression data to identify biologically relevant groups of genes. Recent studies have suggested that the incorporation of biological information into validation methods to assess the quality of clustering results might be useful in facilitating biological and biomedical knowledge discoveries. In this study, we generalize two bio-validity indices, the biological homogeneity index and the biological stability index, to quantify the abilities of soft clustering algorithms such as fuzzy c-means and model-based clustering. The results of an evaluation of several existing soft clustering algorithms using simulated and real data sets indicate that the soft versions of the indices provide both better precision and better accuracy than the classical ones. The significance of the proposed indices is also discussed.

Introduction

Unsupervised clustering methods have been widely applied to the analysis of gene expression data to identify biologically relevant groups of genes. Algorithms such as K-means, hierarchical clustering and self-organizing maps (SOM) are the popular hard clustering methods, which assign each gene to only one cluster. However, this restriction may not be appropriate for analyzing gene expression profiles because a single gene might be involved in multiple functional categories. In contrast, a soft clustering algorithm, such as fuzzy c-means, assigns one gene to multiple clusters according to their degrees of membership and thus gives more information on gene multi-functionalities (Dembele and Kastner, 2003).

To assess the quality and reliability of the clusters produced by a clustering algorithm, a variety of cluster validity indices have been proposed such as the Dunn index (Dunn, 1973), the adjusted Rand index (Hubert and Arabie, 1985), the silhouette width (Rousseeuw, 1987), the figure of merit (FOM) (Yeung et al., 2001), Hennig’s stability index (Hennig, 2007) (all primarily intended for hard clusters), and the Xie–Beni index (Xie and Beni, 1991) and Qiu and Joe’s separation index (Qiu and Joe, 2006) (primarily for fuzzy clusters). Some of these indices are the most popular statistical indices for gene expression data analysis. Refer to Handl et al. (2005) for an overview of cluster validation measures for post-genomic data. On the other hand, recent studies have suggested that the incorporation of biological information, such as functional and/or curated annotations, into validation methods might be useful for supporting biological and biomedical discoveries. For example, Bolshakova et al. (2005) presented a knowledge-driven approach for assessing the cluster validity based on similarities extracted from gene ontology (GO). Park et al. (2005) used a genetic algorithm for fuzzy clustering and prior knowledge of experimental data for their evaluation. Datta and Datta (2006) made use of the biological information along with gene expression data and proposed two indices, the biological homogeneity index (BHI) and the biological stability index (BSI), for the biological evaluation of clustering algorithms. In summary, these indices attempt to produce clustering results with good statistical properties, such as compactness, well-separatedness, connectedness and stability, and also attempt to provide more biologically relevant results.

For soft clustering algorithms, a common way to biologically evaluate the resulting class memberships is to select the class label with the highest membership value so that the existing indices can be applied. However, bio-indices that are designed specifically for hard clustering methods may not be sufficient when applied to the soft clustering of genes since some genes may belong to several statistical clusters. A biological evaluation index that considers the class membership is necessary for the correct evaluation of soft clustering algorithms. In the present study, we generalize the BHI and the BSI to quantify the abilities of soft clustering algorithms to produce biologically meaningful clusters using a reference set of functional classes. The indices investigated are the soft biological homogeneity index (SBHI) and the soft biological stability index (SBSI). To the best of our knowledge, no biological validation indices have been proposed thus far for soft clustering algorithms.

We evaluated the performances of several existing soft clustering algorithms in R (R Development Core Team, 2006), including fuzzy c-means clustering (Bezdek, 1981), fuzzy c-shell clustering (Dave, 1996), model-based clustering (Fraley and Raftery, 2002) and consensus clustering (Monti et al., 2003), on two simulated and three gene expression data sets and identified the optimal algorithm for each number of clusters. A biological reference set for the annotated genes of the relevant species was obtained from the gene ontology database. The proposed indices are helpful for selecting the optimal algorithm from a class of soft clustering algorithms for the given data sets. We provided an R code for computing these indices, whereas the function fclustIndex in package e1071 (Dimitriadou et al., 2006) has several statistical validation measures for fuzzy clusters.

This paper is structured as follows. Section 2 introduces some soft clustering methods to be evaluated, which are available in R. The classical biological validation indices for hard clusters, the BHI and the BSI, are given in Section 3. Section 4 describes the generalized indices, the SBHI and the SBSI, for soft clustering algorithms and discusses their significance. Section 5 provides examples using two simulated and three gene expression data sets. We then conclude the paper in Section 6.

Section snippets

Soft clustering methods

In this section, we give a brief description of several existing soft clustering methods and their availability in R. These algorithms will be evaluated by the proposed indices. Let $X = {x_{1}, x_{2}, \dots, x_{n}}$ be the data set to be analyzed, consisting of $n$ data points in $p$ -dimensional space, and let $C = {c_{1}, c_{2}, \dots, c_{K}}$ be a set of $K$ cluster centers. Let $U = {u_{i j}}$ be an $n \times K$ fuzzy partition matrix where $u_{i j}$ is the membership degree of a point $x_{i}$ in cluster $j$ . Let $m$ be a fuzziness index (often $m = 2$ ).

Biological evaluation indices for hard clustering methods

In this section, we describe the BHI and the BSI, both of which were originally presented in Datta and Datta (2006). Let $B = {B_{1}, \dots, B_{F}}$ be a set of $F$ functional classes that are not necessarily disjoint, and let $B^{i}$ be the functional class containing gene $i$ . Note that a gene may be involved in multiple functional categories.

Biological evaluation indices for soft clustering methods

Based on the BHI and the BSI, we generalize the Eqs. (7), (8) to evaluate soft clustering algorithms, using the degree of memberships $u_{i k}$ as the weight in the formulation. For the hard clusters, $u_{i k} = 1$ if gene $i$ belongs to cluster $k$ and 0 otherwise. For simplicity, $m$ is omitted from $u_{i j}^{m}$ in the following.

Examples

Several soft clustering algorithms freely available in R were evaluated for two simulated and three biological data sets for varying numbers of clusters. The algorithms evaluated include fuzzy c-means (cmeans and fanny), fuzzy c-shell clustering (cshell), model-based clustering (Mclust), consensus clustering (cl_bag) and random clustering (random). The one-minus Pearson correlations (C) were used as a dissimilarity measure for cmeans and fanny. The standard Euclidean distance (E) between

Conclusion

Existing biological knowledge, such as the GO database, can assist in the cluster validation process. In this study, we generalized the biological homogeneity index and the biological stability index by considering the resulting class memberships to formulate a soft biological homogeneity index and a soft biological stability index for evaluating soft clustering algorithms on gene expression data. These fuzzified indices provide better and more precise biological performance measures than the

Acknowledgements

The author thanks the associate editor and the referees for their valuable comments and suggestions. This research was supported by grants from the National Science Council of Taiwan, ROC (NSC 97-2118-M-032-001).

References (28)

V. Bhattacherjee et al.
Neural crest and mesoderm lineage-dependent gene expression in orofacial development
Differentiation
(2007)
R.J. Cho et al.
A genome-wide transcriptional analysis of the mitotic cell cycle
Molecular Cell
(1998)
C. Hennig
Cluster-wise assessment of cluster stability
Computational Statistics & Data Analysis
(2007)
W. Qiu et al.
Separation index and partial membership for clustering
Computational Statistics & Data Analysis
(2006)
P. Rousseeuw
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
Journal of Computational and Applied Mathematics
(1987)
M.C. Abba et al.
Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression
Breast Cancer Research
(2004)
J.C. Bezdek
Pattern Recognition with Fuzzy Objective Function Algorithms
(1981)
N. Bolshakova et al.
A knowledge-driven approach to cluster validity assessment
Bioinformatics
(2005)
G. Brock et al.
clValid: an R package for cluster validation
Journal of Statistical Software
(2008)
S. Datta et al.
Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes
BMC Bioinformatics
(2006)

R.N. Dave

Fuzzy shell-clustering and applications to circle detection in digital images

International Journal of General Systems

(1996)

D. Dembele et al.

Fuzzy c-means method for clustering microarray

Bioinformatics

(2003)

Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., Weingessel, A., 2006. e1071: Misc Functions of the Department of...

S. Dudoit et al.

Bagging to improve the accuracy of a clustering procedure

Bioinformatics

(2003)

Cited by (10)

Multiple hypothesis testing and clustering with mixtures of non-central t-distributions applied in microarray data analysis
2012, Computational Statistics and Data Analysis
Multiple testing analysis and clustering methodologies are usually applied in microarray data analysis. A combination of both methods to deal with multiple comparisons among groups obtained from microarray expressions of genes is proposed. Assuming normal data, a statistic which depends on sample means and sample variances, distributed as a non-central $t$ -distribution is defined. As multiple comparisons among groups are considered, a mixture of non-central $t$ -distributions is derived. The estimation of the components of mixtures is obtained via a Bayesian approach, and the model is applied in a multiple comparison problem from a microarray experiment obtained from gorilla, bonobo and human cultured fibroblasts.
Clustering Algorithm Based on Dual-Index Nearest Neighbor Similarity Measure and Its Application in Gene Expression Data Analysis
2023, Research Square
A New Cluster Validity Index for Fuzzy Clustering Using Separation and Compactness
2023, Research Square
On Gradient Weighting Based Fuzzy Clustering for Graph Data
2017, Proceedings - 13th International Conference on Computational Intelligence and Security, CIS 2017
A novel point density based validity index for clustering gene expression datasets
2017, International Journal of Data Mining and Bioinformatics
Vanishing-point detection based on a fuzzy clustering algorithm and new clustering validity measure
2015, Journal of Applied Science and Engineering

View all citing articles on Scopus

View full text

On biological validity indices for soft clustering algorithms for gene expression data

Abstract

Introduction

Section snippets

Soft clustering methods

Biological evaluation indices for hard clustering methods

Biological evaluation indices for soft clustering methods

Examples

Conclusion

Acknowledgements

Differentiation

Molecular Cell

Computational Statistics & Data Analysis

Computational Statistics & Data Analysis

Journal of Computational and Applied Mathematics

Transcriptomic changes in human breast cancer progression as determined by serial analysis of gene expression

Breast Cancer Research

Pattern Recognition with Fuzzy Objective Function Algorithms

A knowledge-driven approach to cluster validity assessment

Bioinformatics

clValid: an R package for cluster validation

Journal of Statistical Software

Methods for evaluating clustering algorithms for gene expression data using a reference set of functional classes

BMC Bioinformatics

Fuzzy shell-clustering and applications to circle detection in digital images

International Journal of General Systems

Fuzzy c-means method for clustering microarray

Bioinformatics

Bagging to improve the accuracy of a clustering procedure

Bioinformatics