Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Identification of Single- and Multiple-Class Specific Signature Genes from Gene Expression Profiles by Group Marker Index

Abstract

Informative genes from microarray data can be used to construct prediction model and investigate biological mechanisms. Differentially expressed genes, the main targets of most gene selection methods, can be classified as single- and multiple-class specific signature genes. Here, we present a novel gene selection algorithm based on a Group Marker Index (GMI), which is intuitive, of low-computational complexity, and efficient in identification of both types of genes. Most gene selection methods identify only single-class specific signature genes and cannot identify multiple-class specific signature genes easily. Our algorithm can detect de novo certain conditions of multiple-class specificity of a gene and makes use of a novel non-parametric indicator to assess the discrimination ability between classes. Our method is effective even when the sample size is small as well as when the class sizes are significantly different. To compare the effectiveness and robustness we formulate an intuitive template-based method and use four well-known datasets. We demonstrate that our algorithm outperforms the template-based method in difficult cases with unbalanced distribution. Moreover, the multiple-class specific genes are good biomarkers and play important roles in biological pathways. Our literature survey supports that the proposed method identifies unique multiple-class specific marker genes (not reported earlier to be related to cancer) in the Central Nervous System data. It also discovers unique biomarkers indicating the intrinsic difference between subtypes of lung cancer. We also associate the pathway information with the multiple-class specific signature genes and cross-reference to published studies. We find that the identified genes participate in the pathways directly involved in cancer development in leukemia data. Our method gives a promising way to find genes that can involve in pathways of multiple diseases and hence opens up the possibility of using an existing drug on other diseases as well as designing a single drug for multiple diseases.

Introduction

Gene selection/biomarker identification methods can be classified as linear and nonlinear methods [1]. A linear method assumes that good biomarkers for a class will be highly expressed (or unexpressed) for that class and unexpressed (or highly expressed) for the rest of the classes [2], [3]. Nonlinear methods identify biomarkers exploiting both linear and nonlinear interactions between classes and genes [4], [5]. Most methods are linear in nature and there are a few non-linear approaches. Nonlinear approaches can discover a small set of discriminatory genes that are good enough for diagnosis of a set of diseases. But the relationship of biomarkers identified by nonlinear methods with different classes may not be easily visualized. Saeys et al. [6] reviewed various methods of biomarker identification and suggested three categories for the existing methodologies. The first one is the filter method category, which ranks genes independently of the classifier that is used. Many linear methods, either univariate or multivariate, e.g., “SAM” [7], “shrinkage t[8], “correlation-adjusted t[9] belong to this category. Most of these methods are essentially based on Student's t-test or its modified/adapted forms. In some cases, in conjunction with the t-test, authors have used additional procedures to account for the high-dimensional characteristic of microarray data. In addition to t-test, SAM also uses Wilcoxon's Rank Sum test. In this category, other parametric and non-parametric tests have also been used [10], [11]. The second one is the wrapper method category, which selects genes according to the predictive performance of the associated classifier [12]-[14]. The final category of embedded methods assigns weights to the importance of genes by making use of the internal parameters of the classification model – this is an integrated approach where the feature weighting/selection and classifier design are done simultaneously. The well-known SVM-RFE [4], other SVM-based methods, e.g., MMC-RFE [15], MSVM-RFE [16], SVM-RCE [17], and SVM-RNE [18] all belong to this category. Also, some neural network-based methods, e.g., online FSMLP [5], belong to the same category. The advantages of these wrapper and embedded methods are that they can capture non-linear interactions between genes as well as interaction between genes and the diseases. As a result, such marker genes usually provide a better predictive performance.

Most existing gene selection methods (particularly linear methods) can identify only single-class specific signature genes (i.e., genes that are expressed for one class and unexpressed for the rest of the classes). However, there could be genes which are expressed for a subset of classes (say, for a subgroup of cancers). Such genes are biologically informative genes. Unfortunately, we have failed to locate any study dealing with this important issue. None of the methods discussed earlier can be used easily to identify “multiple-class specific” marker genes. In principle, it may be possible to use methods in the filter category, but it will involve extensive computation. For example, due to the two-class discriminant nature of statistics, such as t-statistic (that is frequently used), further strategies, e.g., one-versus-one (OVO), one-versus-all (OVA), are required to apply these methods in identification of multiple-class specific marker genes. Consequently, to find useful biomarkers, a time-consuming procedure considering all possible combinations of classes has to be performed. Pavlidis and Nobel have proposed an ANOVA and template matching based approach [19] to identify such multiple-class specific biomarkers. But as explained above, to find genes with multiple-class specific signatures, we need to try all possible combinations of classes, which would demand considerable computation. Also, another interesting but time consuming procedure has been proposed in [20] to assign significance to statistically defined expression patterns.

On the other hand, for methods in the embedded and wrapper categories, although we can get better predictive performance, the interactions between the biomarkers and the diseases may not be easy to interpret/understand due to the non-linear nature of interactions. Moreover, to get multiple-class specific biomarkers, we shall still require some post-processing, which may not be easy. Finally, there may be some biomarkers for some specific groups of diseases, but such a method may not recognize/identify those because such methods are driven by minimization of classification error. Hence, there is a need for developing methods/algorithms that can find genes with multiple-class specific signatures.

Here we propose a novel gene selection index, Group Marker Index (GMI), which can identify de novo in a single process both single- and multiple-class specific signature genes (both are called group specific genes). It is computationally efficient in the sense that it does not require computation of the index for all possible subsets of classes. For a K-class data set, GMI is evaluated K-1 times rather than for 2K-2 combinations. GMI is a distribution-free method, which shares the advantage of non-parametric methods and is not influenced by the lack of knowledge about the distribution of data. Furthermore, we use GMI with a repeated random sampling procedure to select candidate marker genes in a more reliable manner. Here a permutation procedure is used to assess the significance of selected marker genes. We have used four multiple-class microarray data sets, which are Small Round Blue Cell Tumors (SRBCT) [21], Leukemia [22], Central Nervous System (CNS) [23], and Lung Cancer [24], to demonstrate the effectiveness of the proposed method. We demonstrate that genes identified by the proposed scheme participate in several important biological processes. Scatter-plots of the identified group specific genes for these data sets also exhibit good discriminating power among classes. Although we could not find any method in the literature addressing this problem, just for the sake of comparison, we have proposed and used an algorithm using a template-based gene selection scheme.

Results

Significance of genes selected by using GMI

In this study, we develop a gene evaluation index named “Group Marker Index (GMI)”, to select biologically significant genes for both two- and multiple-class cancer discrimination problems. Given expression profiles on K different types of cancers (diseases), our GMI-based algorithm tries to discover subsets of the cancers/diseases, if present, that can be discriminated from the remaining ones using gene expressions, as well as, the discriminatory genes. If a gene is highly expressed for a subset containing n classes (n = 1, 2, …, K-1) compared to the remaining (K-n) classes, then we call that gene a level-n discriminatory gene. This set of cancer classes with higher expression values is called the “upper group”. On the other hand, the set of remaining classes with lower expression values is called the “lower group”. The GMI algorithm finds, all level-n (n = 1, 2, …, K-1) genes, if exist. The detailed description of the algorithm is provided in the Materials and Methods section.

The quality of genes selected by using GMI is demonstrated here in two significant ways. First, for each of the four data sets, we present visual assessments of the quality of the top most level-n gene for every level-n selected by GMI. We use a scatter-plot to show the distribution of gene expression values for the top most level-n gene. For the scatter-plot the y-axis expresses the observed gene expression values (normalized in [0,1]) and the x-axis indicates the number and identification of samples in a data set. The samples in different classes are represented by different symbols and colors, which help illustrate the discriminating power of (each) individual GMI-selected gene. For example, each sub-panel in Fig. 1 and Figs. S1, S2, S3 displays a top most level-n gene for different level n in the different data sets. As expected, each such gene appears with high gene expression values in the samples from n classes (i.e., the upper group, infra at Computation of GMI), but with low gene expression values in the samples of the remaining classes/subgroups (i.e., the lower group). Also, every top most level-n gene demonstrates good separation and low overlap between samples belonging to upper and lower groups.

thumbnail
Figure 1. Scatter-plots of the top most gene of each level in the SRBCT data set.

Panels (a), (b) and (c) are the scatter-plots of the top most gene of level-1, level-2, and level-3, respectively. The top most genes are WAS (236282), PTPN12 (774502) and GSTA4 (504791), respectively. There are four classes in the SRBCT data set: Ewing sarcomas (EWS), Burkitt lymphomas (BL), neuroblastomas (NB), and rhabdomyosarcomas (RMS).

https://doi.org/10.1371/journal.pone.0024259.g001

In addition, we have further tabulated (Table 1 and Tables S1, S2, S3) the following basic information for the top 10 genes, which are individually selected using GMI from each level of discrimination: probe name, official gene symbol, class-labels in upper group and lower group, frequency of appearance of the gene in the list of top 10 genes, averaged GMI value, p-value and q-value obtained from the permutation test mentioned in the Materials and Methods section. These tables further provide us with several interesting conclusions: (a) We find that all GMI-selected genes have very low p- and q-values (in many cases it is so low that it is represented as zero) suggesting that these genes carry statistically significant multiple-class specific signatures. (b) For a given level of discrimination, the genes with quite high GMI values and appearing with high frequencies are considered better group-specific biomarkers. (c) We use the GMI value to represent a measure of discrimination for a gene between upper and lower groups. Thus, a gene with a higher GMI value implies its higher discriminating power between upper and lower groups. This information can be used for comparison of discriminating power of genes at different levels in the same data set. As demonstrated by the GMI values in Table 1 for the SRBCT data set, the discriminating power of level-2 genes is weaker than that of level-3 genes, and the discriminating power of level-3 genes is weaker than that of level-1 genes. Also as shown by the GMI values set forth in Table S3 for the Lung Cancer data set, the discriminating power of level-3 and level-4 genes is weaker than that of level-1 and level-2 genes. In addition, such phenomena can also be validated by the scatter-plot of the top most gene of each level for all data sets (Fig. 1 and Figs. S1, S2, S3). For example, WAS, the top most level-1 gene in the SRBCT data set, has a GMI value of 2.98 which is higher than the GMI value, 1.02, of PTPN12, the top most gene of level-2 in the same data set. From Fig. 1, we can observe that the discriminating ability of the WAS gene is much stronger than that of the PTPN12 gene because the WAS gene has no overlapped sample between upper and lower groups, whereas the PTPN12 gene has some overlapped samples. Hence, the visual illustration of the top most genes is consistent with the corresponding GMI values. (d) The GMI values can provide an objective assessment of the discriminating power even between data sets. For example, VAMP2, the top most level-1 gene in the Lung Cancer data set, has a GMI value of 7.19, showing discriminating power which appears even stronger than WAS, the top most level-1 gene in the SRBCT data set. Figure 1 and Figure S3 also reveal the stronger discriminating power of VAMP2 over WAS because VAMP2 also has no overlapped samples between upper and lower groups, and gets a bigger numerator for computing GMI (a higher value of the closest separation between a class in the upper group and a class in the lower group; i.e., the difference between the mean values of classes Cn(s) and Cn+1(s), or μSep  =  μn(s) – μn+1(s) in Eq. (1) mentioned in the Materials and Methods section). Our results demonstrate that GMI is very effective in identifying group-specific marker genes with clear discriminating power and we could obtain a robust list of candidate genes by conducting the repeated random splitting procedure.

thumbnail
Table 1. Summary of top 10 genes of each level selected by GMI in the SRBCT data set.

https://doi.org/10.1371/journal.pone.0024259.t001

In addition, the conventional biomarkers for cancers are usually the level-1 genes with single tissue specific expression patterns. The protein products of those marker genes are usually receptors or proteins expressed in the cell surface. GMI can also identify the known biomarkers. For example, FGFR4, the level-1 gene highly expressed in rhabdomyosarcomas (RMS) class, is ranked within the top 10 level-1 genes in the SRBCT data set (Table 1). Multiple studies have reported that FGFR4 is highly expressed in RMS [25]-[27] and the mRNA expression level of FGFR4 is correlated with its protein level [25], [26], [28]. Furthermore, it has been shown that mutationally activated FGFR4 acts as an oncogene [29]. Another level-1 gene ranked within the top 10 for the SRBCT data set, CD99, is highly expressed in Ewing sarcomas (EWS) class. It has been reported that CD99 is highly expressed in all EWS and the engagement of CD99 with anti-CD99 monoclonal antibodies would induce massive apoptosis as well as reduce malignant potential of EWS cells [30]. The deletion of CD99 expression in human EWS cell lines would reduce their abilities of tumorigenesis and metastasis [31]. Moreover, the engagement of CD99 improves the efficiency of the conventional chemotherapeutic agents and reduces tumor growth along with a significant delay of metastasis [32]. These are some of the examples to demonstrate that GMI is capable of identifying the known biomarkers as well as the special type of group biomarkers.

Comparison with template-based method

For a fair comparison of our GMI method with the existing methods for similar purpose, we have selected the most widely used template-matching based method and adapted it using the same iterative procedure as followed for GMI. We call this method as the template-based method (TBM). The detailed steps of the template-based method are described in the Materials and Methods section. First, we have used TBM to select group specific genes in the SRBCT and CNS data sets. Although, for the CNS data set, GMI can identify good level-2 discriminatory genes, it cannot identify good level-2 discriminatory genes for the SRBCT data set (please refer to Table 1 and Table S2). So for the SRBCT data set we want to check if there are useful level-2 discriminatory genes and GMI cannot identify those. For the CNS data set, we want to check whether GMI has already found good level-2 and level-3 discriminatory genes. In other words, we want to check if TBM can identify better genes. For each case, we have compared the top 10 genes identified by GMI and TBM.

Comparing the top 10 level-2 genes in the SRBCT data set (Table 2), we find that both GMI and TBM select PTPN12 (774502) as the top most gene. Therefore, there may be no better level-2 gene in the SRBCT data set. To further check whether GMI miss any good level-2 gene, we focus on those genes which are identified by TBM but not by GMI. Between the two gene lists, there are four common genes. With a careful inspection of scatter-plots of the TBM-selected unique genes, we find that for each of such genes there is a substantial overlap between the upper and the lower groups. For a few of the GMI-selected unique genes, a similar situation arises. Thus to make a better objective evaluation of the two lists, for each gene in Table 2, we have performed a leave-one-out cross validation (LOOCV) with the nearest neighbor classifier (NNC), which enables us to assess the discriminating power of each gene for the two groups of diseases. The resultant accuracies are tabulated in the last column of Table 2. On the average, the GMI-selected unique genes are superior to the TBM-selected unique genes. One GMI-selected unique gene, Probe ID  =  365826 (ranked the second by GMI) achieves the highest accuracy of 0.9048! On the other hand, Probe ID  =  841620, which is ranked the second and selected only by TBM, achieves the least accuracy of 0.6508! This suggests that GMI selected genes are better discriminator; however, as we have mentioned earlier, finding genes suitable for designing classifiers is not the objective of this study.

thumbnail
Table 2. The comparison of top 10 level-2 genes selected by GMI and TBM in the SRBCT data set.

https://doi.org/10.1371/journal.pone.0024259.t002

Table S4 reveals that for the CNS data set in the lists of top 10 level-2 genes identified by GMI and TBM there are 8 common genes; while for the level-3 genes there are 5 common genes (Table S5). Most of the genes common to both methods are on the top of the TBM gene list. These findings imply that there may not be better level-2 and level-3 discriminatory genes in the CNS data set. Also, the higher agreement between the gene lists identified by GMI and TBM in this data set with more balanced distribution of number of samples over classes may indicate that difference in sample sizes between classes can affect the gene selection.

To investigate the effect of variation of sample sizes between classes, we have compared the gene lists produced by GMI and TBM on Leukemia and Lung Cancer data sets. All classes in the Leukemia data set have comparable sample size. On the other hand, the Lung Cancer data set has one class with a relatively large sample size. We compare the top 10 genes for these two data sets. For the Leukemia data set, considering the level-2 genes we find that seven of the top 10 genes are common in the two lists. Moreover, both lists have the same genes at the top (Table S6). This high percentage of common genes in the two lists might be taken as an indicator that TBM would identify genes similar to those identified by GMI in cases where there is not much variation between sample sizes from different classes.

For the Lung Cancer data set, we consider both level-2 and level-3 discriminatory genes. In the case of level-2 genes, we have three common genes in the top 10 genes identified by GMI and TBM (Table S7). In order to assess the quality of the genes identified by GMI and TBM, we have compared the scatter-plots of those genes, which are identified by only one of the methods. A careful inspection of the scatter-plots suggests that the genes identified only by TBM are not better genes than those identified only by GMI for the purpose of discrimination (see File S1). Furthermore, TAGLN3 (32650_at), which is ranked by GMI as the top most level-2 gene with the highest frequency of selection, is not in the top 10 gene list selected by TBM. But the scatter-plot of TAGLN3 (Figure S3b) reveals that it is definitely a good level-2 gene. Concordant results are also revealed by the single gene LOOCV accuracy as depicted in Table S7, where TAGLN3 achieves the perfect accuracy. While comparing the level-3 discriminatory genes, we have found that there are three common genes between the lists of top 10 genes identified by GMI and TBM (Table S8). Four of the top 10 genes (2nd, 3rd, 5th and 10th) selected by TBM appear much like level-2 genes (see Figs. 2a-d). The top sixth and eighth genes selected by TBM appear more like level-1 genes (see Figs. 2ef). Unlike genes selected by TBM, the scatter-plots of the top genes identified by GMI reveal that these genes are good level-3 discriminatory genes. On the average, the single gene LOOCV accuracies of GMI-selected unique genes are higher than the accuracies of TBM-selected unique genes.

thumbnail
Figure 2. The level-2-like and level-1-like genes ranked within top 10 level-3 genes by template-based method in the Lung Cancer data set.

Panels (a), (b), (c) and (d) are the scatter-plots of the level-2-like genes. Panels (e) and (f) are the scatter-plots of the level-1-like genes.

https://doi.org/10.1371/journal.pone.0024259.g002

We find that when the class sizes are balanced, more genes are found common in the top 10 level-n genes produced by GMI and TBM. When the class sizes are widely different, the lists of top 10 level-n genes are significantly different. And GMI is found to identify better discriminatory genes. A natural question arises, why? A possible reason for this may be the fact that correlation value for a gene with the designated template (used in TBM) is quite sensitive to the relative sizes of different classes in the training data. For example, even if the sizes of different classes remain the same, but the number of samples from one class keeps on increasing, the correlation value exhibit a monotonic behavior. This is illustrated in Fig. 3. Suppose in a K-class problem, class 1 has n1 samples, class 2 has n2 samples and the remaining classes together have n3 samples. Suppose for a gene g, all samples from class 1 and 2 are highly expressed (with gene expression value of 1), while for the other classes the gene g is unexpressed (i.e., the gene expression value is zero). In this case, the correlation for the gene g with the ideal vector for class 1 (the ideal vector will have expression of 1 for class 1 and 0 for all other classes) will be . It can be easily seen that as n3 increases with more and more unexpressed samples keeping the nature of the n1 and n2 samples from class 1 and 2 unaltered, the correlation goes to . The filled circle in Fig. 3 illustrates this when n1 = 60, n2 = 15 and n3 varies from 20 to 300. Note that, when n3 = 20, the correlation value is 0.68 but that increases to 0.85 when n3 becomes 140, although neither the expression profiles for class 1 and class 2 change nor the expression profile of the remaining classes changes, and for a multiple-class data 140 is not a big number. Thus, the template-based method, which selects genes based on correlation value, sometimes is not sensitive enough to reflect the desired behavior of samples. This will lead to improper ranking.

thumbnail
Figure 3. Effect of sample size on Pearson's correlation coefficient values.

https://doi.org/10.1371/journal.pone.0024259.g003

Following the same experimental design, we add one extra class (class (K+1)) with only five samples, which is either up-regulated or down-regulated. The template is so set that samples in class 1 and class (K+1) are highly expressed while unexpressed in the remaining classes. When class (K+1) is up-regulated, the tendency of the resultant correlation values (shown by the symbol “+” in Fig. 3) is almost the same as that of the K-class case. In case, class (K+1) is down-regulated, the nature of the resultant correlation values (shown by triangles in Fig. 3) is still similar to the previous two cases with correlation values in a slightly lower region. In these two (K+1)-class cases, most samples are unexpressed, which are contributed by the pool (class 3 to class K) with n3 samples. Therefore, it has practically no effect when we set class (K+1) up-regulated. On the other hand, there is a little influence when class (K+1) is down-regulated. However, as the number of samples increases, the impact becomes less and less.

The level 2-like genes selected by TBM in the Lung Cancer data set (Figs. 2a–d) are good examples to show that this effect indeed happens in real data. All of those four genes are selected based on the template with high expression for classes 2, 3, and 5 (please refer to Table S8). However, the Lung Cancer data set is quite unbalanced in the number of samples between different classes. There are only six samples for class 3, which contributes a little in computing the correlation value of TBM (only four samples are randomly selected during the repeated random splitting procedure). Therefore, irrespective of whether class 3 is highly expressed or unexpressed, the contribution of class 3 in correlation may not be significant (Figs. 2bd). Even for class 2 (Normal) which has 17 samples, the selection by TBM suffers from the same problem (Fig. 2a). Similarly, the level 1-like genes (Figs. 2ef) can also support this effect.

Biological Relevance of some level-2 and level-3 biomarkers

CNS data set.

The first 4 genes in level-2 are highly expressed in human cerebellar tumors and malignant glioma (i.e., Ncer and MGlio; upper group representing 4 and 2) and are practically unexpressed in the rest of the classes (lower group consisting of 1, 3 and 5). Interestingly, six of the top genes in level-2 (PRUNE2, TIMP4, TMOD1, ADORA1, NEUROD1 and C1orf61/Croc4) with GMI score ranging from 2.38 to 1.17 are primarily, if not solely, involved in cytoskeleton maintenance. Perturbation in the expression levels of these genes is likely to affect morphological, structural and functional integrity of the cell. For example, PRUNE2 gene is over-expressed in prostate cancer and down-regulates Rho-A and Rho-C that are involved in actin polymerization and oncogenic transformation [33], [34]. TIMP4 inhibits tumor progression by inhibiting cell matrix degradation by endopeptidase MMP-2 [35]. TMOD1 regulates actin filament dynamics [36]. On the other hand, C1orf61/Croc4 positively controls c-fos activity and the later one up-regulates actin expression [37], [38]. Moreover, NEUROD1, a transcriptional factor, controls transcription of cytoskeletal genes [39]. In cancer, one of the frequently affected pathways is the cytoskeletal structural organization and the genes in this pathway are dys-regulated but not necessarily mutated. For example, cofilin expression is frequently increased in glioblastoma and ovarian cancer [40], [41] whereas cortactin is often over-expressed in breast cancer and squamous carcinoma of head and neck [42], [43]. Literature search revealed that none of the level-2 genes are ever mutated in any cancer but rather work to protect the cytoskeletal integrity of the cell. Over-expression of these genes in human cerebellar tumors and malignant glioma may reflect as an innate attempt of the cell to counteract the process of tumor transformation.

Earlier we proposed GDI system (Gene Dominant/Dormant Index) that selects two specific types of gene, i.e., dominant and dormant genes wherein the former is up-regulated in one specific class and down-regulated in the remaining classes, and the latter is down-regulated in one specific class but up-regulated in the rest of the classes [1]. Although different cancers, according to their tissue of origin, do differ in their mode of action through regulation and/or dys-regulation of various physiological pathways, they do also share some common genes/pathways in the formation of cancers. In that sense, GMI-discovered genes that are up-regulated in more than one-class of cancer (here we can also termed them as co-dominant) may provide additional biological insights of how various cancers might be related and probable molecular pathways involved in them.

In case of the level-3 genes, it appears that they are functionally diverse. However, the first two genes share a common cytoskeletal pathway. For example, FEZ1 (GMI score 3.95), highly expressed in the group {4, 2, 5}, is a brain specific cytoskeletal regulatory protein associated with microtubule and in various tumors it is either deleted and/or mutated [44]. But here in malignant glioma, human cerebellar tumor and primitive neuroectodermal tumor (PNET) the expression of FEZ1 is increased signifying a failed attempt of the cell to contain the tumor growth. Recent study shows that FRG1 (GMI value 2.07, up-regulated in group {1, 4, 3}) a multifunctional protein, specially binds to F-actin and its misregulation leads to facioscapulohumeral muscular dystrophy (FSHD) [45]. HMGN2 with GMI value of 1.11 (up-regulated in group {1, 3, 5}) is a highly conserved nucleosomal protein involved in unfolding higher-order chromatin structure and acts as an impediment for cell migration [46], [47]. VAT1 with GMI value of 1.14 (up-regulated in group {3, 1, 2}) controls the storage and release of neurotransmitters in the nerve terminal [48]. Importantly, none of these top-5 genes in both level-2 (except PRUNE2) and level-3 are ever reported as biomarkers (over-expressed and/or under-expressed) for any type of cancer. In that sense, our analysis demonstrates that these co-dominant genes may be used as unique signature genes in defining the respective tumor groups.

Lung Cancer data set.

Level-2 and level-3 discriminatory genes in the Lung Cancer data set are quite interesting. As in the case for CNS data set, the four top level-2 discriminatory genes with GMI values ranging from 3.45 to 1.58 are primarily involved in regulation of cytoskeleton that controls vesicular trafficking in golgi bodies [49]. These genes are basically co-dominant in classes 5 and 3, representing pulmonary carcinoids (COID) and small cell lung cancer (SCLC), respectively. TAGLN3 associates with and regulates F-actin, α-tubulin, tau and MAP2 [50]. CRMP1 is a phosphoprotein and controls microtubules [51]. NCAM1 is a classical cell adhesion molecule that interacts with a number of cytoskeletal proteins and regulates cell architecture. INSM1 gene encodes a zinc finger DNA-binding domain and a putative prohormone domain. This gene is a sensitive marker for neuroendocrine differentiation of human lung tumors [52]. Interestingly, INSM1 acts as a transcriptional repressor of NEUROD that is involved in regulating cytoskeleton genes [53]. It is important to note that four of the first five genes in level-2 (TGLN3, CRMP1, NCAM1 and SCAMP5) are supposed to be neuron-specific [54]-[57]. But our analysis shows that these neuron-specific genes are aberrantly expressed in COID and SCLC class of lung tumor, most likely as a result of mis-functioning of SWI/SNF complex [58]. The fact that the top 4 level-2 discriminatory genes both in CNS and Lung Cancer data sets are involved in the same molecular pathway might be a pure coincidence or else it could be that pathways of cancer, irrespective of types, are frequently involved dys-regulating genes associated with cytoskeleton architecture.

On the other hand, the top 6 level-3 discriminatory genes are basically involved in calcium signaling; however, they bear quite a low GMI values ranging from 1.25 to 0.34. These genes, except SEC14L1, are primarily co-dominant in the classes 4, 1 and 2, i.e., squamous cell carcinomas (SQ), lung adenocarcinomas (Adeno) and normal lung (Normal). S100A11 and S100A10 are calcium binding proteins and their increased expression are often observed in colorectal and prostate cancer, including non-small cell lung cancer [59], [60]. VAMP8 is involved in calcium-dependent exocytosis process and SEC14L1 is a membrane trafficking protein, and none of them is linked to cancer pathways. On the other hand, LGALS3, an IgE-lectin binding protein, is often abundantly expressed in thyroid cancer, hepatocellular carcinoma and in non-small cell lung cancer [61], [62]. FAM38 is involved in intracellular calcium release [63]. Calcium signaling plays very important role in cancer in mediating angiogenesis steps such as invasion, adhesion and tumor cell migration. In many cancers, calcium signaling genes are up-regulated. The fact that the aforementioned genes are also up-regulated in normal lung (class 2) indicates that intact calcium signaling pathway may be necessary for tumorigenesis process in these cancer subgroups.

In a nutshell GMI method establishes that cytoskeleton regulating genes are aberrantly expressed in pulmonary carcinoids (COID) and small cell lung cancer (SCLC) whereas calcium signaling pathway genes are active in the rest of the classes, i.e., squamous cell carcinomas (SQ), adenocarcinomas (Adeno), and normal lung (Normal).

Typical microarray data analysis finds markers that can differentiate one class of cancer from the others, while GMI can find markers, which exhibit similar expression pattern in a group of cancers and can distinguish one group of cancers from another group of cancers. Thus, the importance of our GMI algorithm is in identifying de novo group specific genes in an automatic manner. When we have samples from multiple cancers/diseases with less knowledge about the relationship between them, GMI could be very useful to discover a group (subset) of cancers, if exists, that can be discriminated from the remaining set of cancers using expression pattern of one or more genes. This in turn can help us to understand the relationship between subsets of cancers/diseases groups via functional analysis of the GMI-identified group specific genes. Further, existence and discovery of such group markers open up the possibility of finding common drug targets for different diseases as well as possibility of using drugs designed for one disease to cure another disease.

KEGG pathway analysis

After analyzing the relationship between a single gene and multiple diseases, we attempt to explore the relationship between the groups of discriminatory genes and diseases. For this, instead of considering level-1 discriminatory genes, which may be identified by all gene selection methods, we focus on the level-2 discriminatory genes first. We try to identify related biological pathways in which the level-2 discriminatory genes are involved by cross-referencing to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [64]-[66].

We take the advantage of the abundant number of genes in the Leukemia and Lung Cancer data sets in the data selection for pathway analysis. For the Leukemia data set, we consider all level-2 genes with p-values smaller or equal to 0.0001. Subsequently, we divide these level-2 genes into three different gene lists based on the composition of the upper group. The numbers of genes, actually probe sets, in each list were: 712 for the upper group composed of acute lymphoblastic leukemia (ALL) and mixed-lineage Leukemia (MLL), which we call ALL-MLL list; 3 for the upper group composed of ALL and acute myelogenous leukemia (AML), which is termed as ALL-AML list; 100 for the upper group composed of MLL and AML (MLL-AML list). Next, we use the functional annotation tool provided by the database for annotation, visualization and integrated discovery (DAVID ver. 6.7) [67], [68] to cross-reference to the KEGG database (accepted default parameters).

For the ALL-MLL list, 709 probe sets that belong to Homo sapiens, are selected and the default background is used for the analysis. A total of 665 DAVID IDs are converted and 228 probe sets are involved in the KEGG pathway category. The output results provided by DAVID are summarized in Table 3. The identified pathways/annotation terms are tabulated in the first column (Term). The total number of genes (Count) and the corresponding percentage (%) in our gene list involved in each pathway along with the modified Fisher Exact p-value (p-value), and other enrichment quantitative measurements (Fold Enrichment, Bonferroni, Benjamini, FDR) are also included in Table 3. As shown in Table 3, there are four pathways with EASE Score, the modified Fisher Exact p-value, smaller than 0.01. These are spliceosome, B-cell receptor signaling, basal transcription factors and inositol phosphate metabolism pathways. Literature search revealed that these pathways are often impaired in ALL and MLL development [69][73]. On the other hand, if we consider EASE Score smaller than 0.05, then 15 pathways are identified. The full table with probe set lists is provided in Table S9. Most of these additional pathways are involved in cancer as evident by their name such as colorectal cancer pathways, base excision repair, mismatch repair, nucleotide excision repair, and more specifically involvement of chronic myeloid leukemia pathway is noteworthy. Interestingly, pathways like fatty acid metabolism, phosphatidylionositol signaling and ubiquitin mediated proteolysis that at first sight seemed to be remotely associated with ALL or MLL but are indeed found to be very much molecularly involved in the pathogenesis of ALL and MLL as revealed by literature search [74][77].

thumbnail
Table 3. Summarization of the identified pathways related to the level-2 discriminatory genes in the Leukemia data set.

https://doi.org/10.1371/journal.pone.0024259.t003

Using the same procedure, 98 probe sets belonging to Homo sapiens and one probe set belonging to Mus musculus are selected for the MLL-AML list. These probe sets are mapped to 89 DAVID IDs and only 33 probe sets are involved in the KEGG pathway category. Befitting our method of analysis, the only pathway identified by the functional annotation tool with EASE Score smaller than 0.01 is the Acute Myeloid Leukemia pathway (KEGG entry ID: hsa05221). This attests that our method of analysis may be relevant in identifying the gene(s) signatures for distinguishing subtypes of cancer. Additional three pathways, B cell signaling, Chemokine signaling and Fc epsilom R1 signaling, which have passed the EASE Score criterion (p-value < 0.05) are also shown in the Table 3. These are all well documented pathways that are impaired in AML and/or MLL cancer [78]-[80]. Moreover, MLL is a special ALL group, which carries MLL gene translocation. The authors of the Leukemia data set [22] collected ALL and MLL samples from the individuals diagnosed as CD19+ B-precursor ALL without and with MLL translocation, respectively. Those well-known B cell marker genes, e.g., CD19, CD81, CD79A, CD79B, are all contained in the ALL-MLL list. Some other genes, which have been discussed by the authors [22] e.g., IL7R, DNTT, TCF3, POU2AF1 and SMARCA4, are also listed in the ALL-MLL list. The rest of genes in the ALL-MLL list are related to B cell proliferation, phosphorylation, DNA replication, tumor development. On the other hand, the myeloid-specific genes e.g., CCNA1, SERPINB1, RNASE3, and some other genes that are discussed by the authors who published the data set e.g., CD44, HOXA9, HOXA5, SPN, LGALS1, ANXA1, ANXA2, are contained in the MLL-AML list. The only three genes exhibiting common up-regulated pattern in ALL and AML are RYK, SCHIP1, and YESP1. The long list of level-2 discriminatory genes for the ALL and MLL group suggests that between the three classes, the ALL and MLL group share more similar physiological properties. MLL represents mixed-lineage features yet we can find some similarities between MLL and AML. However, ALL and AML are distinguished from each other. In a nutshell, many of the level-2 discriminatory genes identified in our analysis for leukemia class-specific signatures are directly involved in pathways that lead to the development of leukemia. We believe that GMI would be helpful in understanding the relationship between unknown classes.

Furthermore, we find that the B cell receptor signaling pathway was identified by both ALL-MLL list and MLL-AML list from Table 3. To further investigate this pathway, we downloaded the figure of this pathway from KEGG [64][66] and labeled the level-2 genes using the same principle as used by DAVID [67], [68]. The modified figure is shown in Fig. S4. In Figure S4, those genes, which are relatively up-regulated in the ALL-MLL group, are labeled with red stars. On the other hand, those genes, which are up-regulated in MLL-AML group, are labeled with blue stars. We may interpret those red star genes as lymphoblastic genes because of the relatively up-regulated expression in both ALL and MLL but down-regulated expression in AML. Subsequently, those blue star genes can be treated as myelogenous genes because of their relatively lower expression in ALL. As summarized by KEGG [64][66], the activation of this pathway will involve in B cell proliferation, differentiation and Ig production as well as other processes. Thus, we find that many genes involved in this pathway are labeled with red stars (highly expressed in the CD19+ B-precursor ALL and MLL). We may interpret that the activation of CD79A (Igα) and CD79B (Igβ) with the co-simulators CD81 and CD19 triggers the activation of B cell signaling pathway in ALL and MLL. Furthermore, the co-inhibitor LIRB3 (PIR-B) is also relatively down-regulated in ALL which could be treated as a positive factor for activating the B cell signaling pathway. Notably, the relatively lower expression of Rac in ALL may play an important role of lymphoblastic leukemia. In murine study, it has been demonstrated that Rac genes are important for appropriate positioning of hematopoietic stem cells (HSCs) within the bone marrow microenvironment. The deletion of both Rac1 and Rac2 murine alleles would lead to a massive egress of HSCs into the blood from the marrow [81], [82]. Thus, we may understand the role of B cell signaling pathway between those different groups of leukemia. This is an example to demonstrate how multiple level-2 (in general level-k) genes can be useful to interpret the observations and to understand the mechanisms behind. We believe that GMI algorithm will be a useful algorithm to the community.

In the Lung Cancer data set, we select all probe sets with p-value smaller than or equal to 0.0001 and group these probe sets into different gene lists for different compositions of the upper group. In the results shown in Table 4, there are five combinations in which we can identify related pathways (EASE Score < 0.05). These five combinations are Adeno-Normal, Adeno-SQ, Normal-SCLC, SCLC-SQ and Normal-COID. Each of these combinations belongs to the upper group. These pathways associated with SCLC-COID list are independently tabulated in Table 5. The full tables with probe set lists are also provided in Table S10 and Table S11. Out of these five combinations two combinations, Adeno-SQ and SCLC-SQ are important in order to molecularly distinguish four classes of lung cancer. In SCLC-COID group out of 27 pathways identified, 10 pathways are directly involved in cancer as their name suggests and additional pathways like splicesome, phosphatidylinositol signaling, calcium signaling and cell cycle pathways are also participatory in cancer process as discussed earlier. Importantly, the pathways like wnt, ErbB, MAPK, autophagy and Jak-Stat signaling are all well established in cancer development process. However, it is presently unclear as why Alzheimer's, Type II Diabetes, Long-term potentiation and Neurotrophin signaling pathways genes are up-regulated in SCLC-COID groups since their roles in cancer are not well founded. In the Adeno-SQ list, genes belonging to pathways of Ribosome, ECM receptor interaction and Focal adhesion were found to be up-regulated. Literature search revealed that all of these pathways are more or less compromised in Adeno-SQ [83]-[86].

thumbnail
Table 4. Summarization of the identified pathways related to the level-2 discriminatory genes in the Lung Cancer data set (Part I).

https://doi.org/10.1371/journal.pone.0024259.t004

thumbnail
Table 5. Summarization of the identified pathways related to the level-2 discriminatory genes in the Lung Cancer data set (Part II).

https://doi.org/10.1371/journal.pone.0024259.t005

To summarize, KEGG pathway analysis establishes that there is as such no specific pathway(s) that can exclusively determine the subgroups of cancer. A plethora of pathways, shared and/or non-shared, is activated in various cancer subgroups. The level-n discriminatory genes found by GMI are important genes that play a major role in multiple cancer pathways.

Discussion

In this study, we emphasize the important role played by multiple-class specific marker genes. We found that most available gene selection algorithms focus on or tend to identify single-class specific signature genes as marker genes. But, multiple-class specific markers may play important roles in biology besides construction of computational prediction systems. The lack of intuitive, easy-to-use methodologies for biologists inspired us to propose a novel gene selection algorithm based on an index called, Group Marker Index (GMI), which is efficient in the identification of both single- and multiple-class specific signature genes from microarray data. For the sake of comparison, a template-based method (TBM) is also formulated and used along with GMI. GMI differs from template-based methods in two aspects. First, to evaluate each gene, GMI uses the ordering of mean gene expression values for all classes that enables it to identify de novo certain combination of classes for each level of discrimination. It does not need to check all combinations of classes and hence it reduces drastically the computation overhead. We have demonstrated that GMI can identify robust and statistically significant marker genes for each level of discrimination using a repeated random-splitting procedure (Table 1 and Tables S1, S2, S3). Second, no prior knowledge is required for template assignment. It is possible for GMI to infer novel relationships between the studied classes/diseases. We have discussed the relevance/biological roles played by several level 2/3 genes. In nutshell, we observed selective dys-regulation of cytoskeleton regulating gene-network primarily at level-2 in both CNS and lung cancer. Secondly, we found that neuron-specific cytoskeleton genes are aberrantly expressed in COID and SCLC tumor of lung. Third, calcium signaling pathway genes are upregulated at level-3 of lung tumor. This method hitherto uncovered the importance of cytoskeleton genes and their use as class-specific markers for cancer diagnosis. In addition, we have also mapped a group of level 2/3 genes to the KEGG pathways. Both results exhibited concordant findings and implied potentially common properties between different classes or cancers. It suggests the possibility of identifying common drug targets between different diseases. It also opens up the possibility of using a specific remedy designed for one disease to cure another disease.

Second, GMI uses a non-parametric indicator, between-class-transition (BCT), to evaluate the discrimination between classes rather than evaluating the similarity between a template and gene expression values. An advantage of exploiting discrimination ability rather than similarity is that it ensures that the selected genes will be able to discriminate between subsets of classes at least to a reasonable extent. Inspired by the work in [19], [20] where authors used Analysis of Variance (ANOVA) along with the template-matching step for their purposes, here we have performed the ANOVA test between gene expression values and class labels for each gene. We find that a gene with a low ANOVA p-value often does not exhibit good discrimination ability between classes. Our experience shows that most genes in a multiple-class microarray data set are evaluated with small p-values by ANOVA test (See Fig. S5). Thus, for such a problem the main criterion used for gene selection is essentially the correlation value. Moreover, since TBM relies on correlation, it suffers from an additional problem (as already explained) while applying on the microarray data with unbalanced sample sizes over different classes. We have designed an example with unbalanced samples (Fig. 3) to demonstrate that the class with large sample size biases the resultant correlation value and the class with small sample size may not contribute much to the correlation. This is an inherent limitation of template based methods.

Materials and Methods

Data sets

SRBCT data set.

It is a cDNA microarray data set with 63 samples from 4 classes of childhood small round blue cell tumors (SRBCT): 23 Ewing sarcomas (EWS), 8 Burkitt lymphomas (BL), 12 neuroblastomas (NB), and 20 rhabdomyosarcomas (RMS). Each sample is represented by 2308 genes. This data set is available at http://research.nhgri.nih.gov/microarray/Supplement/.

Leukemia data set.

This Affymetrix high-density oligonucleotide array data set has 57 samples from 3 classes of leukemia: 20 acute lymphoblastic leukemia (ALL), 17 mixed-lineage leukemia (MLL), 20 acute myelogenous leukemia (AML), each with 12582 genes. This data set is available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi.

CNS data set.

CNS is also an Affymetrix high-density oligonucleotide microarray data set containing 42 samples distributed over 5 different types of tumors of the central nervous system (CNS): 10 medulloblastomas (MD), 10 malignant gliomas (MGlio), 10 atypical teratoid/rhabdoid tumors (Rhab), 8 primitive neuro-ectodermal tumors (PNET), and 4 human cerebella tumors (Ncer). For this data set each sample is represented by 7129 genes. This data set is available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi.

Lung Cancer data set.

In this Affymetrix high-density oligonucleotide array, we have 203 samples in 12600 dimensions. There are 5 classes: 139 lung adenocarcinomas (Adeno), 21 squamous cell lung carcinomas (SQ), 20 pulmonary carcinoids (COID), 6 small-cell lung cancer (SCLC), and 17 normal lung specimens (Normal). This data set can be obtained from http://www.pnas.org/content/suppl/2001/11/13/191502998.DC1/DatasetA_12600gene.xls.

Preprocessing

For the Leukemia and CNS data sets, in the preprocessing step the gene expression values less than 100 are raised to 100 and gene expression values greater than 16000 are set to 16000. All gene expression values are then subjected to a base 10 logarithmic transformation. After that, the distribution of gene expression values in each sample is adjusted to zero mean and unit variance. For the SRBCT data set, we do not make any change to the gene expression values as that had already been preprocessed in the original data source [21]. For the Lung Cancer data set, we use the same preprocessed data as used in the original paper [24] without doing any additional preprocessing. For these four data sets, we adopt the same data preprocessing protocols as used in a previous study [1], [15]. In this study, the analysis is conducted using the R environment [87].

Group Marker Index (GMI)

Our main objective is to develop a gene evaluation index, which we call “Group Marker Index (GMI)”, to select biologically significant genes for both two- and multiple-class discrimination problems. GMI builds on the Gene Dominant/Dormant Index (GDI) which was proposed in our earlier study. GDI is a gene evaluation index to select two specific types of genes, i.e., dominant and dormant genes defined in our previous study [1]. In a multiple-class microarray data set, a dominant gene has high gene expression values only in the samples from one specific class and low gene expression values in the samples of the remaining classes. Contrary to the dominant gene, a dormant gene has low gene expression values in only one specific class and high gene expression values in the rest of the classes. GMI is a further refinement of GDI in two aspects: First, irrespective of the number of classes, GDI will only evaluate two levels of discrimination (i.e., dominant and dormant conditions). GMI, on the other hand, will evaluate K-1 levels of discrimination where there are K classes in a pooled microarray experiment. Second, GDI uses the same equation of signal-to-noise ratio (SNR) to evaluate genes, which makes GDI sensitive to outliers, if any. We have avoided these deficiencies by utilizing a simple yet novel intuitive concept, referred to as Between-Class-Transition (BCT), as a further refinement to help identify overlapping classes.

Between-Class-Transition (BCT)

Let be the gene expression value for the gth gene in the sth sample, g = 1, 2, …, G; s = 1, 2, …, S. The sth sample is associated with a class label . For each gene g, g = 1, 2, …, G, we sort the gene expression values of the S samples in descending order. In the sorted sequence, each gene expression value is associated with the class label of the corresponding sample. In this sorted sequence, if the class labels of two successive samples (gene expression values) are different, we count it as a between-class-transition (BCT). In this way, for each gene, we find the total number of BCTs.

Illustration of BCT

We now illustrate BCT using a simple synthetic data set. Consider the gene expression values of a gene for 10 samples divided into two classes (first five samples belong to class 1 and the rest belong to class 2) as depicted in the left panel of Fig. S6. The sorted gene values are depicted in the right panel of Fig. S6. Here we have projected the samples on a vertical line and represented the two classes using two different symbols. In the right panel, the fourth sample from class 1 is sandwiched between two samples from class 2 and this adds two to the BCT count. In this way, the total number of BCTs is 5. An ideal marker gene in a two-class problem will have no outlier sample in either class, and the discrimination between the two classes will be very easy. In such a case the BCT value is only one (for the transition from one class to the other). In all other cases, where there are some outlier samples with gene expression values that overlap with the gene expression values of the samples from the other class, the BCT value will be larger than one. Note that, there could be BCTs even when there is no outlier in the true sense of the word, but the classes have overlap. With an increase in the number of overlapped samples, the BCT value will increase. Therefore, we can use the number of BCTs to help identify overlapped samples, and thus improve marker gene identification.

Consider a two-class problem, as an example. Suppose for a gene g, the samples from each class form compact clusters and the clusters are well separated except for the gene expression value of just one sample in one of the two classes, which is mixed with samples from the other class. In this case, the effect of the outlier on the standard deviation will depend on its location; in other words, how far the outlier is from the mean of the gene expression values among samples in that class. Thus, the effect of just one outlier (mixed sample) on the standard deviation could be moderate to severe. However, this is not the case with BCTs because the number of BCTs will not depend on the location of the outliers in the other class.

Computation of GMI

For easy understanding, Fig. 4 depicts the steps involved in the computation of GMI, which are explained next.

thumbnail
Figure 4. Steps involved to compute GMI and to find the list of group specific genes for each level of discrimination.

https://doi.org/10.1371/journal.pone.0024259.g004

Normalization.

The gene expression values of each gene are normalized in the range from 0 to 1 across samples. This step preserves the richness in the original gene expression values for each gene among the samples and helps us to easily visualize the distribution of gene expression values for the discriminatory genes.

Computation of the mean value of each class for each gene.

For each gene, the mean of the gene expression values in each class is calculated. Let the mean for gene g in class k be μgk.

Sorting of the mean values and defining of K-1 levels of discrimination.

For notational simplicity, to explain the computation of the GMI for gene g, we ignore the index g. We sort μk, k = 1, 2, …, K (K  =  number of classes) in descending order. Let the sorted mean values be μk(s); k = 1, 2, …, K. Suppose is the mean for class C1(s). This means that the gene under consideration is the most highly expressed in class C1(s). We shall call a gene level-n discriminatory gene, if it can discriminate between two groups of classes where one group has n classes in it and the other group has the remaining (K-n) classes. Moreover, for a level-n discriminatory gene, it is highly expressed in the group with n classes while in the remaining K-n classes, this gene is poorly expressed or unexpressed. Note that, if there are K classes, we can define (K-1) levels of discrimination.

Computation of GMI for each level of discrimination.

For a good marker gene with clear level-n discrimination, there are n classes that have high gene expression values and all other classes have low gene expression values. Therefore, we define the set of classes with high gene expression values as “upper group” and the set of remaining classes as “lower group”. For a good marker gene with level-n discrimination the upper and lower groups should be well separated and the overlap between the groups should be low. For level-n discrimination for gene g, first we focus on the separation between upper and lower groups. We use the difference between the mean values of classes Cn(s) and Cn+1(s), i.e., μSep  =  μn(s) – μn+1(s). This is the closest separation between a class in the upper group and a class in the lower group. We take this as the separation between the two groups of classes. Note that, we are not using the group mean. Next we define a measure of overlap. For this we find the number of BCTs between these two groups. To consider the effect of sample size, we use a weight parameter, NS, to normalize the BCT value. Suppose Nup and Nlow represent the total number of samples in upper and lower groups and let NS  =  Min{Nup, Nlow}. Then a measure of overlap between the two groups can be defined as . The GMI value for level-n discrimination for gene g is then defined as the ratio of μSep and OVL :(1)A high value of GMI will indicate good separation with low overlap between the two groups.

Finding a list of group specific genes for each level of discrimination.

After calculating the GMI values of K-1 levels of discrimination for all genes, a list of group specific genes for each level can be obtained as follows. Every gene has K-1 GMI values. Clearly, genes with higher GMIn values have better level-n discriminating power between upper and lower groups. Now for a given discrimination level-n, we sort all genes in descending order of GMIn values. The sorted values may be denoted as GMIn(s). A smaller rank indicates a larger GMIn value and hence a better level-n discriminating power of the gene. From the top of the list of sorted level-n genes, we can select a set of genes that is likely to be biologically interesting and is expected to be useful for level-n discrimination. Note that, the upper and lower groups associated with different level-n genes could be completely different. At this point, we would like to emphasize that designing of classifier or diagnostic prediction system is not the primary objective of this study. The main objective is to find genes that can discriminate between two sets of diseases, not necessarily one disease verses other. We also emphasize that the proposed scheme neither uses pooled mean of a group of classes nor uses the pooled standard deviation of a set of classes and hence is free from the problems of SNR type indexes [1].

Illustration of GMI Computation

To understand the computation of the GMI value at each level of discrimination for a given gene, we have generated a synthetic example, which is composed of five classes of samples. As shown in Fig. 5, there are five classes of samples and there are ten samples in each class. The samples from different classes are labeled with different colors. Since there are K = 5 classes, we evaluate the gene for 4 possible levels of discrimination. In Fig. 5, panel a to panel d, represent, respectively, the computation of level-1 to level-4 discrimination of this gene. In panel a, the level-1 discrimination is evaluated between the third class and the fourth class because they are ranked first and second based on the descending order of the mean of the gene expression values in the five classes. These two highest ranked means are used to calculate μSep and this separation is indicated by two horizontal dashed lines in green and blue. For an easy computation of BCTs, in all four panels, the upper and lower groups are represented using filled in and empty symbols, respectively. Similarly, for panels b to d. For example, in panel b, for level-2 discrimination, μSep is computed using the mean values of class four and class five. From these figures, it is clear that this synthetic gene is a good level-2 gene because from panel b we find that two groups are well separated and the BCT is just 1. This is also revealed by the GMI values of 0.150, 9.537, 0.287, and 1.925 for level-1 to level-4, respectively.

thumbnail
Figure 5. A 5-class synthetic example to illustrate computation of GMI.

There are four levels of discrimination in the 5-class synthetic data set. Panels (a) to (d) depict the computation of GMI values at each level of discrimination. The dotted lines in each panel indicate the two mean values used for GMI computation in each level of discrimination. All filled samples in each panel indicate the upper group samples. The remaining open samples in each panel indicate the lower group samples.

https://doi.org/10.1371/journal.pone.0024259.g005

Gene selection and evaluation of statistical significance

To evaluate the statistical significance of group specific genes identified by GMI, we perform a permutation procedure to obtain the corresponding p- and q-values. This procedure is similar to the method used in our previous study [1]. The necessary steps are summarized below. Let G be the total number of genes and S be the total number of samples.

Step 1. Gene selection.

Step 1.1 Repeated random splitting. Given a microarray data set D with K classes (xgs is the gene expression value of gene g in sample s; 1≤gG, 1≤sS) and with class labels (γs, 1≤sS), we randomly select 2/3rd samples from each class as the training set TR(r) (r denotes the rth random selection of samples, 1≤rR).

Step 1.2 Computation of GMI and preliminary gene selection. For each training set TR(r), we compute the GMI values for different levels of discrimination GMIng(r) (1≤nK-1) for each gene g. Simultaneously, we use an independent indicator Fng(r) to denote whether the gene g is included in the list of top N1 genes (N1 genes ranked in descending order of GMIng(r) values); Fng(r) = 1 represents “true” and Fng(r) = 0 represents “false”.

Step 1.3 Gene selection. After R = 100 times of the random selection of samples, we average the GMI values of different levels of discrimination, GMIng(r), 1≤rR, for each gene g as GMIng(ave) and sum up the Fng(r) values for each gene g as Fng(sum). Note that, Fng(sum) is the number of times (i.e., frequency) the gene g is selected as one of the top N1 genes in R experiments. For each level of discrimination, we select the top N2 genes with the highest frequencies, Fng(sum). In this study, we set both N1 and N2 = 10.

Step 2. Permutation.

We randomly permute the class labels γs for B times. In the bth permutation (1≤bB), we randomly select 2/3rd samples from each class as the training set TR(r)(b) for R = 100 times. For each training set TR(r)(b), we compute GMIng(r) (b) for different levels of discrimination for gene g using the permuted class labels γs(b). Next, we average these new GMI values of different levels of discrimination GMIng(r)(b) for each gene g as GMIng(ave)(b). These GMIng(ave)(b) are used for the calculation of p- and q-values.

Step 3. Calculation of p-values.

The p-value of the observed averaged GMI value, GMIng(ave), for a particular level of discrimination, of a gene g is(2)Where I(.) is an indicator function that takes the value one when true, and zero otherwise.

Step 4. Calculation of q-values.

To account for the multiple tests being performed in the G genes, the q-value of the observed averaged GMIng(ave) is calculated as(3)In this study, we have performed this permutation test with B = 200 for all data sets.

Comparison with other method

In order to demonstrate the effectiveness of our GMI method in identifying group specific genes, we propose a scheme based on the template-based method (TBM), which is similar to the method used in a previous work [19] for comparison. In this study, we used Pearson's correlation coefficient to evaluate the relation between gene expression values and pre-assigned templates. The detailed steps of gene selection in the proposed TBM are described below:

TBM Step 1: Repeated random splitting.

For a fair comparison with GMI, the repeated random splitting scheme is also used here. For a given microarray data set D with K classes (xgs is the gene expression value of gene g in sample s; 1≤gG, 1≤sS) with class labels (γs, 1≤sS), we randomly select 2/3rd samples from each class as the training set TR(r) (as earlier, r denotes the rth random selection of samples, 1≤rR).

TBM Step 2: Normalization.

For each training set TR(r), the gene expression values of every gene are normalized across samples to [0, 1].

TBM Step 3: Identification of group specific genes for level-n discrimination.

For every gene in the training set TR(r), K-1 levels of discrimination are defined as in the previous section. To find group specific marker genes for level-n discrimination, the following steps are followed:

TBM Step 3.1 Creation of template T0(r).We create a template T0(r){0, 0, 0, … , 0}; the length of T0(r) is equal to the number of samples in training set TR(r). Initially, every value in T0(r) is set to zero.

TBM Step 3.2 Creation of template Tn(m). For a given level-n, there could be many combinations (subsets) of classes for the upper group. We denote the appropriate template for the mth combination (for level-n discrimination) as Tn(m)(r) and generate it from T0(r) as follows. If the sth sample of gene g belongs to the upper group (i.e., in the mth combination of classes) then the sth value of T0(r) is set to 1. This modified T0(r) is the template Tn(m)(r).

TBM Step 3.3 Computation of Pearson's correlation coefficient. Note that, for level-n, mth combination, and sample set TR(r), the Tn(m)(r) is fixed. Now we compute the Pearson's correlation coefficient for every gene g, with the template Tn(m)(r) and denote it as Pn(m)(r)(g). For level-n we have combinations. Let . Now we sort the G correlation values in descending order and create the list of top N1 genes (along with the associated upper groups) for level-n as Ln(r). Thus, a typical element of the list will have two components: the upper group (E) and the gene (g).

TBM Step 4: Gene selection.

Steps 1 through 3 are repeated for r = 1 to R times (here R = 100) resulting in R lists of top N1 genes, Ln(r); r = 1, …, R. Let Fn(g, m) be the number of times (frequency) with which the gene g, associated with upper group m, appears in the R lists. This results in a list where every gene has associated with it just one frequency and one upper group. Now, we sort this list in descending order based on the frequencies and select the top N2 genes from that list. Note that, each of these N2 genes will have an associated upper group. Here N1 = N2 = 10 as used with GMI.

Supporting Information

Figure S1.

Scatter-plots of the top most gene of each level in the Leukemia data set.

https://doi.org/10.1371/journal.pone.0024259.s001

(PDF)

Figure S2.

Scatter-plots of the top most gene of each level in the CNS data set.

https://doi.org/10.1371/journal.pone.0024259.s002

(PDF)

Figure S3.

Scatter-plots of the top most gene of each level in the Lung Cancer data set.

https://doi.org/10.1371/journal.pone.0024259.s003

(PDF)

Figure S4.

The modified figure of B cell receptor signaling pathway.

https://doi.org/10.1371/journal.pone.0024259.s004

(PDF)

Figure S5.

The distributions of ANOVA p-values for four data sets.

https://doi.org/10.1371/journal.pone.0024259.s005

(PDF)

Figure S6.

Illustration of between-class-transition (BCT).

https://doi.org/10.1371/journal.pone.0024259.s006

(PDF)

Table S1.

Summary of top 10 genes of each level selected by GMI in the Leukemia data set.

https://doi.org/10.1371/journal.pone.0024259.s007

(PDF)

Table S2.

Summary of top 10 genes of each level selected by GMI in the CNS data set.

https://doi.org/10.1371/journal.pone.0024259.s008

(PDF)

Table S3.

Summary of top 10 genes of each level selected by GMI in the Lung Cancer data set.

https://doi.org/10.1371/journal.pone.0024259.s009

(PDF)

Table S4.

The comparison of top 10 level-2 genes selected by GMI and TBM in the CNS data set.

https://doi.org/10.1371/journal.pone.0024259.s010

(PDF)

Table S5.

The comparison of top 10 level-3 genes selected by GMI and TBM in the CNS data set.

https://doi.org/10.1371/journal.pone.0024259.s011

(PDF)

Table S6.

The comparison of top 10 level-2 genes selected by GMI and TBM in the Leukemia data set.

https://doi.org/10.1371/journal.pone.0024259.s012

(PDF)

Table S7.

The comparison of top 10 level-2 genes selected by GMI and TBM in the Lung Cancer data set.

https://doi.org/10.1371/journal.pone.0024259.s013

(PDF)

Table S8.

The comparison of top 10 level-3 genes selected by GMI and TBM in the Lung Cancer data set.

https://doi.org/10.1371/journal.pone.0024259.s014

(PDF)

Table S9.

The full summary table of the identified pathways related to the level 2 discriminatory genes in the Leukemia data set.

https://doi.org/10.1371/journal.pone.0024259.s015

(PDF)

Table S10.

The full summary table of the identified pathways related to the level 2 discriminatory genes in the Lung cancer data set (Part I).

https://doi.org/10.1371/journal.pone.0024259.s016

(PDF)

Table S11.

The full summary table of the identified pathways related to the level 2 discriminatory genes in the Lung cancer data set (Part II).

https://doi.org/10.1371/journal.pone.0024259.s017

(PDF)

File S1.

The level-2 genes selected only by GMI and the level-2 genes selected only by TBM in the Lung Cancer data set.

https://doi.org/10.1371/journal.pone.0024259.s018

(PDF)

Author Contributions

Conceived and designed the experiments: Y-ST NRP I-FC. Performed the experiments: Y-ST. Analyzed the data: Y-ST. Wrote the paper: Y-ST KA NRP I-FC. Biological relevance interpretation: KA Y-ST. Led and coordinated the investigation: NRP I-FC.

References

  1. 1. Tsai YS, Lin CT, Tseng GC, Chung IF, Pal NR (2008) Discovery of dominant and dormant genes from expression data using a novel generalization of SNR for multi-class problems. BMC Bioinformatics 9: 425.
  2. 2. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531–537.
  3. 3. Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97: 77–87.
  4. 4. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46: 389–422.
  5. 5. Pal NR, Aguan K, Sharma A, Amari SI (2007) Discovering biomarkers from gene expression data for predicting cancer subgroups using neural networks and relational fuzzy clustering. BMC Bioinformatics 8: 5.
  6. 6. Saeys Y, Inza I, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23: 2507–2517.
  7. 7. Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98: 5116–5121.
  8. 8. Opgen-Rhein R, Strimmer K (2007) Accurate ranking of differentially expressed genes by a distribution-free shrinkage approach. Stat Appl Genet Mol Biol 6: 9.
  9. 9. Zuber V, Strimmer K (2009) Gene ranking and biomarker discovery under correlation. Bioinformatics 25: 2700–2707.
  10. 10. Liu XX, Krishnan A, Mondry A (2005) An Entropy-based gene selection method for cancer classification using microarray data. BMC Bioinformatics 6: 76.
  11. 11. Huerta EB, Duval B, Hao JK (2008) Fuzzy logic for elimination of redundant information of microarray data. Genomics Proteomics Bioinformatics 6(2): 61–73.
  12. 12. Li L, Weinberg CR, Darden TA, Pedersen LG (2001) Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17: 1131–1142.
  13. 13. Ooi CH, Tan P (2003) Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics 19: 37–44.
  14. 14. Jirapech-Umpai T, Aitken S (2005) Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes. BMC Bioinformatics 6: 148.
  15. 15. Niijima S, Kuhara S (2006) Recursive gene selection based on maximum margin criterion: a comparison with SVM-RFE. BMC Bioinformatics 7: 543.
  16. 16. Zhou X, Tuck DP (2007) MSVM-RFE: extensions of SVM-RFE for multiclass gene selection on DNA microarray data. Bioinformatics 23: 1106–1114.
  17. 17. Yousef M, Jung S, Showe LC, Showe MK (2007) Recursive Cluster Elimination (RCE) for classification and feature selection from gene expression data. BMC Bioinformatics 8: 144.
  18. 18. Yousef M, Ketany M, Manevitz L, Showe LC, Showe MK (2009) Classification and biomarker identification using gene network modules and support vector machines. BMC Bioinformatics 10: 337.
  19. 19. Pavlidis P, Noble WS (2001) Analysis of strain and regional variation in gene expression in mouse brain. Genome Biol 2: 42.
  20. 20. Hulshizer R, Blalock EM (2007) Post hoc pattern matching: assigning significance to statistically defined expression patterns in single channel microarray data. BMC Bioinformatics 8: 240.
  21. 21. Khan J, Wei JS, Ringnér M, Saal LH, Ladanyi M, et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7: 673–679.
  22. 22. Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, et al. (2002) MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Genet 30: 41–47.
  23. 23. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, et al. (2002) Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415: 436–442.
  24. 24. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, et al. (2001) Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA 98: 13790–13795.
  25. 25. Khan J, Wei JS, Ringnér M, Saal LH, Ladanyi M, et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7: 673–679.
  26. 26. Baird K, Davis S, Antonescu CR, Harper UL, Walker RL, et al. (2005) Gene expression profiling of human sarcomas: insights into sarcoma biology. Cancer Res 65: 9226–9235.
  27. 27. Yu Y, Khan J, Khanna C, Helman L, Meltzer PS, et al. (2004) Expression profiling identifies the cytoskeletal organizer ezrin and the developmental homeoprotein Six-1 as key metastatic regulators. Nat Med 10: 175–181.
  28. 28. Davicioni E, Finckenstein FG, Shahbazian V, Buckley JD, Triche TJ, et al. (2006) Identification of a PAX-FKHR gene expression signature that defines molecular classes and determines the prognosis of alveolar rhabdomyosarcomas. Cancer Res 66: 6936–6946.
  29. 29. Taylor JG VI, Cheuk AT, Tsang PS, Chung JY, Song YK, et al. (2009) Identification of FGFR4-activating mutations in human rhabdomyosarcomas that promote metastasis in xenotransplanted models. J Clin Invest 119: 3395–3407.
  30. 30. Cerisano V, Aalto Y, Perdichizzi S, Bernard G, Manara MC, et al. (2004) Molecular mechanisms of CD99-induced caspase-independent cell death and cell-cell adhesion in Ewing's sarcoma cells: actin and zyxin as key intracellular mediators. Oncogene 23: 5664–5674.
  31. 31. Rocchi A, Manara MC, Sciandra M, Zambelli D, Nardi F, et al. (2010) CD99 inhibits neural differentiation of human Ewing sarcoma cells and thereby contributes to oncogenesis. J Clin Invest 120: 668–680.
  32. 32. Scotlandi K, Baldini N, Cerisano V, Manara MC, Benini S, et al. (2000) CD99 engagement: an effective therapeutic strategy for Ewing tumors. Cancer Res 60: 5134–5142.
  33. 33. Salagierski M, Verhaegh GW, Jannink SA, Smit FP, Hessels D, et al. (2009) Differential expression of PCA3 and its overlapping PRUNE2 transcript in prostate cancer. Prostrate 70: 70–78.
  34. 34. Maekawa M, Ishizaki T, Boku S, Watanabe N, Fujita A, et al. (1999) Signaling from Rho to the actin cytoskeleton through protein kinases ROCK and LIM-kinase. Science 285: 895–958.
  35. 35. Wang M, Liu YE, Greene J, Sheng S, Fuchs A, et al. (1997) Inhibition of tumor growth and metastasis of human breast cancer cells transfected with tissue inhibitor of metalloproteinase 4. Oncogene 14: 2767–2774.
  36. 36. Kostyukova AS, Rapp BA, Choy A, Greenfield NJ, Hitchcock-DeGregori SE (2005) Structural requirements of tropomodulin for tropomyosin binding and actin filament capping. Biochemistry. 44. : 4905–4910. doi:10.1021/bi047468.
  37. 37. Dilley WG, Kalyanaraman S, Verma S, Cobb JP, Laramie JM, et al. (2005) Global gene expression in neuroendocrine tumors from patients with the MEN1 syndrome. Mol Cancer 4: 9.
  38. 38. Bishopric NH, Jayasena V, Webster KA (1992) Positive regulation of the skeletal alpha-actin gene by Fos and Jun in cardiac myocytes. J Biol Chem 267: 25535–25540.
  39. 39. Seo S, Lim JW, Yellajoshyula D, Chang LW, Kroll KL (2007) Neurogenin and NeuroD direct transcriptional targets and their regulatory enhancers. EMBO J 26: 5093–5108.
  40. 40. Martoglio AM, Tom BD, Starkey M, Corps AN, Charnock-Jones DS, et al. (2000) Changes in tumorigenesis- and angiogenesis-related gene transcript abundance profiles in ovarian cancer detected by tailored high density cDNA arrays. Mol Med 6: 750–765.
  41. 41. Yap CT, Simpson TI, Pratt T, Price DJ, Maciver SK (2005) The motility of glioblastoma tumour cells is modulated by intracellular cofilin expression in a concentration-dependent manner. Cell Motil Cytoskeleton 60: 153–165.
  42. 42. Campbell DH, deFazio A, Sutherland RL, Daly RJ (1996) Expression and tyrosine phosphorylation of EMS1 in human breast cancer cell lines. Int J Cancer 68: 485–92.
  43. 43. Patel AM, Incognito LS, Schechter GL, Wasilenko WJ, Somers KD (1996) Amplification and expression of EMS-1 (cortactin) in head and neck squamous cell carcinoma cell lines. Oncogene 12: 31–35.
  44. 44. Ishii H, Baffa R, Numata SI, Murakumo Y, Rattan S, et al. (1999) The FEZ1 gene at chromosome 8p22 encodes a leucine-zipper protein, and its expression is altered in multiple human tumors. Proc Natl Acad Sci USA 96: 3928–3933.
  45. 45. Liu Q, Jones TI, Tang VW, Brieher WM, Jones PL (2010) Facioscapulohumeral muscular dystrophy region gene-1 (FRG-1) is an actin-bundling protein associated with muscle-attachment sites. J Cell Sci 123: 1116–1123.
  46. 46. Gerlitz G, Bustin M (2010) Efficient cell migration requires global chromatin condensation. J Cell Sci 123: 2207–2217.
  47. 47. Bustin M (1999) Regulation of DNA-Dependent Activities by the Functional Motifs of the High Mobility-Group Chromosomal Proteins. Mol Cell Biol 19(8): 5237–5246.
  48. 48. Linial M, Miller K, Scheller RH (1989) VAT-1: an abundant membrane protein from Torpedo cholinergic synaptic vesicles. Neuron 2: 1265–1273.
  49. 49. Han C, Chen T, Yang M, Li N, Liu H, et al. (2009) Human SCAMP5, a Novel Secretory Carrier Membrane Protein, Facilitates Calcium-Triggered Cytokine Secretion by Interaction with SNARE Machinery. J Immunol 182: 2986–2996.
  50. 50. Depaz IM, de Las Heras R, Kroon PA, Wilce PA (2005) Changes in neuronal protein 22 expression and cytoskeletal association in the alcohol-dependent and withdrawn rat brain. J Neurosci Res 81: 253–60.
  51. 51. Gu Y, Ihara Y (2000) Evidence that collapsin response mediator protein-2 is involved in the dynamics of microtubules. J Biol Chem 275: 17917–20.
  52. 52. Lan MS, Breslin MB (2009) Structure, expression, and biological function of INSM1 transcription factor in neuroendocrine differentiation. FASEB J 23(7): 2024–2033.
  53. 53. Liu WD, Wang HW, Muguira M, Breslin MB, Lan MS (2006) INSM1 functions as a transcriptional repressor of the neuroD/β2 gene through the recruitment of cyclin D1 and histone deacetylases. Biochem J. 397(Pt 1): 169–177.
  54. 54. Ren WZ, Ng GY, Wang RX, Wu PH, O'Dowd BF, et al. (1994) The identification of NP25: a novel protein that is differentially expressed by neuronal subpopulations. Brain Res Mol Brain Res 22: 173–185.
  55. 55. Yamashita N, Uchida Y, Ohshima T, Hirai S, Nakamura F, et al. (2006) Collapsin response mediator protein 1 mediates reelin signaling in cortical neuronal migration. J Neuroscience 26(51): 13357–13362.
  56. 56. Atz ME, Rollins B, Vawter MP (2007) NCAM1 association study of bipolar disorder and schizophrenia: polymorphisms and alternatively spliced isoforms lead to similarities and differences. Psychiatr Genet 17(2): 55–67.
  57. 57. Fernández-Chacón R, Südhof TC (2000) Novel SCAMPs Lacking NPF Repeats: Ubiquitous and Synaptic Vesicle-Specific Forms Implicate SCAMPs in Multiple Membrane-Trafficking Functions. J Neurosci 20(21): 7941–7950.
  58. 58. Watanabe H, Mizutani T, Haraguch T, Yamamichi N, Minoguchi S, et al. (2006) SWI/SNF complex is essential for NRSF-mediated suppression of neuronal genes in human nonsmall cell lung carcinoma cell lines. Oncogene 25: 470–479.
  59. 59. Tian T, Hao J, Xu A, Hao J, Luo C, et al. (2007) Determination of metastasis-associated proteins in non-small-cell lung cancer by comparative proteomic analysis. Cancer Sci 98: 1265–74.
  60. 60. Zhang L, Fogg DK, Waisman DM (2004) RNA interference-mediated silencing of the S100A10 gene attenuates plasmin generation and invasiveness of Colo 222 colorectal cancer cells. J Biol Chem 279: 2053–2062.
  61. 61. Martins L, Matsuo SE, Ebina KN, Kulcsar MA, Friguglietti CU, et al. (2002) Galectin-3 messenger ribonucleic acid and protein are expressed in benign thyroid tumors. J Clin Endocrinol Metab 87: 4806–10.
  62. 62. Yoshimura A, Gemma A, Hosoya Y, Komaki E, Hosomi Y, et al. (2003) Increased expression of the LGALS3 (galactin 3) gene is human non-small cell lung cancer. Genes Chromosomes Cancer 37: 159–64.
  63. 63. McHugh BJ, Buttery R, Lad Y, Banks S, Haslett C, et al. (2010) Integrin activation by Fam38A uses a novel mechanism of R-Ras targeting to the endoplasmic reticulum. J Cell Sci 123: 51–61.
  64. 64. Kanehisa M, Goto S, Furumichi M, Tanabe M, Hirakawa M (2010) KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res 38: D355–D360.
  65. 65. Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, et al. (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34: D354–357.
  66. 66. Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res 28: 27–30.
  67. 67. Huang DW, Sherman BT, Lempicki RA (2009) Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nature Protoc 4(1): 44–57.
  68. 68. Huang DW, Sherman BT, Lempicki RA (2009) Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 37(1): 1–13.
  69. 69. van Alphen RJ, Wiemer EAC, Burger H, Eskens ALM (2009) The spliceosome as target for anticancer treatment. Br J Cancer 100(2): 228–232.
  70. 70. Nahar R, Müschen M (2009) Pre-B cell receptor signaling in acute lymphoblastic leukemia. Cell Cycle 8(23): 3874–3877.
  71. 71. Downing JR (1999) The AML1-ETO chimaeric transcription factor in acute myeloid leukaemia: biology and clinical significance. Br J Haematol 106(2): 296–308.
  72. 72. Sun L, Heerema N, Crotty L, Wu X, Navara C et al (1999) Expression of dominant-negative and mutant isoforms of the antileukemic transcription factor Ikaros in infant acute lymphoblastic leukemia. Proc Natl Acad Sci USA 96(2): 680–685.
  73. 73. Mengubas K, Jabbar S, Nye K, Wilkes S, Hoffbrand AV, et al. (1994) Inactivation of calcium ion-regulating inositol polyphosphate 2nd-messengers is impaired in subpopulations of human leukemia-cells. Leukemia 8(10): 1718–1725.
  74. 74. Fiancette R, Vincent c, Donnard M, Bordessoule d, Turlure P, et al. (2009) Genes encoding multiple forms of phospholipase A2 are expressed in immature forms of human leukemic blasts. Leukemia 23: 1196–1199.
  75. 75. Agatha G, Häfer R, Zintl F (2001) Fatty acid composition of lymphocyte membrane phospholipids in children with acute leukemia. Cancer Lett 173(2): 139–144.
  76. 76. Chiarini F, Falà F, Tazzari PL, Ricci F, Astolfi A, et al. (2009) Dual inhibition of class IA phosphatidylinositol 3-kinase and mammalian target of Rapamycin as a new therapeutic option for T-cell acute lymphoblastic leukemia. Cancer Res 69: 3520–3528.
  77. 77. Aifantis I, Raetz E, Buonamici S (2008) Molecular pathogenesis of T-cell leukaemia and lymphoma. Nat Rev Immunol 8: 380–390.
  78. 78. Zweidler-McKay PA, He Y, Xu L, Rodriguez CG, Karnell FG, et al. (2005) Notch signaling is a potent inducer of growth arrest and apoptosis in a wide range of B-cell malignancies. Blood 106(12): 3898–3906.
  79. 79. Tavor S, Eisenbach M, Jacob-Hirsch J, Golan T, Petit I, et al. (2008) The CXCR4 antagonist AMD3100 impairs survival of human AML cells and induces their differentiation. Leukemia 22: 2151–2158.
  80. 80. Bruserud Ø, Gjertsen BT, Ulvestad E (2002) Expression of Fcε-receptors by human acute myelogenous leukemia (AML) blasts: studies of high- and low- (CD23) affinity receptor expression and the effects of IgE-mediated receptor ligation on functional AML blast characteristics. Leuk Res 26(5): 515–521.
  81. 81. Cancelas JA, Lee AW, Prabhakar R, Stringer KF, Zheng Y, et al. (2005) Rac GTPases differentially integrate signals regulating hematopoietic stem cell localization. Nat Med 11: 886–891.
  82. 82. Gu Y, Filippi MD, Cancelas JA, Siefring JE, Williams EP, et al. (2003) Hematopoietic cell regulation by Rac1 and Rac2 guanosine triphosphatases. Science 302: 445–449.
  83. 83. Kuroda K, Takenoyama M, Baba T, Shigematsu Y, Shiota H, et al. (2009) Identification of ribosomal protein L19 as a novel tumor antigen recognized by autologous cytotoxic T lymphocytes in lung adenocarcinoma. Cancer Sci 101(1): 46–55.
  84. 84. Nadano D, Notsu T, Matsuda T, Sato TA (2002) A human gene encoding a protein homologous to ribosomal protein L39 is normally expressed in the testis and derepressed in multiple cancer cells. Biochim Biophys Acta 1577(3): 430–436.
  85. 85. Ocak S, Yamashita H, Udyavar AR, Miller AN, Gonzalez AL, et al. (2010) DNA copy number aberrations in small-cell lung cancer reveal activation of the focal adhesion pathway. Oncogene 29: 6331–6342.
  86. 86. Mukhopadhyay NK, Gordon GJ, Chen CJ, Bueno R, David J, Sugarbaker DJ, et al. (2007) Activation of focal adhesion kinase in human lung cancer cells involves multiple and potentially parallel signaling events. J Cell Mol Med 9(2): 387–397.
  87. 87. R Development Core Team (2010) R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0, URL http://www.R-project.org/.