Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Nearest shrunken centroids via alternative genewise shrinkages

  • Byeong Yeob Choi,

    Affiliations Department of Epidemiology and Biostatistics, University of Texas Health Science Center, San Antonio, TX, United States of America, Department of Biostatistics, University of North Carolina, Chapel Hill, NC, United States of America

  • Eric Bair,

    Affiliations Department of Biostatistics, University of North Carolina, Chapel Hill, NC, United States of America, Department of Endodontics, University of North Carolina, Chapel Hill, NC, United States of America

  • Jae Won Lee

    jael@korea.ac.kr

    Affiliation Department of Statistics, Korea University, Anam-Dong, Seoul, South Korea

Abstract

Nearest shrunken centroids (NSC) is a popular classification method for microarray data. NSC calculates centroids for each class and “shrinks” the centroids toward 0 using soft thresholding. Future observations are then assigned to the class with the minimum distance between the observation and the (shrunken) centroid. Under certain conditions the soft shrinkage used by NSC is equivalent to a LASSO penalty. However, this penalty can produce biased estimates when the true coefficients are large. In addition, NSC ignores the fact that multiple measures of the same gene are likely to be related to one another. We consider several alternative genewise shrinkage methods to address the aforementioned shortcomings of NSC. Three alternative penalties were considered: the smoothly clipped absolute deviation (SCAD), the adaptive LASSO (ADA), and the minimax concave penalty (MCP). We also showed that NSC can be performed in a genewise manner. Classification methods were derived for each alternative shrinkage method or alternative genewise penalty, and the performance of each new classification method was compared with that of conventional NSC on several simulated and real microarray data sets. Moreover, we applied the geometric mean approach for the alternative penalty functions. In general the alternative (genewise) penalties required fewer genes than NSC. The geometric mean of the class-specific prediction accuracies was improved, as well as the overall predictive accuracy in some cases. These results indicate that these alternative penalties should be considered when using NSC.

Introduction

Nearest shrunken centroids (NSC) is one of the most frequently used classification methods for high-dimensional data such as microarray data [1, 2]. NSC shrinks the average expression (i.e., centroid) of each gene within each class toward the overall centroid via soft thresholding. Genes whose expression levels do not significantly differ between the classes will have their centroids reduced to the overall centroids, effectively removing them from the classification procedure. The amount of shrinkage is determined by cross validation. Then class prediction is performed using the shrunken centroids, which allows one to identify important genes and predict the class of unlabeled observations.

Wang and Zhu [3] showed that NSC is the solution to the regression problem that estimates the class centroids subject to an L1 penalty (i.e., LASSO) of Tibshirani [4]. They observed that the LASSO penalty applies the same penalties to all centroids, but the centroids for the same gene should be treated as one group. To overcome this problem, they proposed two NSC methods using different penalties: adaptive L-norm penalized NSC (ALP-NSC) and adaptive hierarchically penalized NSC (AHP-NSC). They showed that the two NSC methods have better performance than the original NSC in terms of misclassification error rate and the number of variables with nonzero centroids. However, ALP-NSC requires an exhaustive search to find an index set satisfying certain condition. If no such indices exist, quadratic programming must be employed to estimate the parameters. AHP-NSC requires an iterative procedure to estimate the parameters, and this increases the computational burden as the number of genes increases.

While Wang and Zhu [3] sought to improve NSC by considering the correlation between the centroids for the same gene, Guo et al. [5] improved NSC by regularizing the covariance matrix of genes in addition to shrinking the class centroids. In fact, Guo et al. [5] modified the classical linear discriminant score, not the diagonal linear discriminant score, and thus the method of Guo et al. [5] is a generalized version of NSC. Pang et al. [6] proposed an improved diagonal linear discriminant analysis (LDA) through shrinkage and regularization of the variances, but their method dose not perform variable selection. Several authors proposed new types of sparse LDA and provided the related optimality conditions and asymptotic properties. Shao et al. [7] applied the thresholding methodology, which was developed for function estimation, to the estimation of the means and variances, and Mai et al. [8] used the least squares formulation of LDA.

Another way to improve NSC is to modify the way to select an optimal threshold as in Blagus and Lusa [9]. They improved NSC in class-imbalanced data by selecting the optimal threshold as the value that maximizes the geometric mean of the class-specific prediction accuracies. Their numerical studies showed that the modified NSC improved the prediction accuracy of the minority class and area under the curve (AUC), and even the average prediction accuracy of entire classes for some real data.

In this article, we proposed the methods that improve NSC through alternative shrinkage of the class centroids. Like Wang and Zhu [3], we used an additional parameter, which controls the amount of penalization given to the parameters for our methods. These alternative shrinkages were derived from three existing alternative penalized regression methods, namely the smoothly clipped absolute deviation (SCAD) [10], the adaptive LASSO (ADA) [11], and the minimax concave penalty (MCP) [12], which are known to outperform LASSO regression in some situations. They enjoy the oracle property, which means that the efficiency of these estimators is not reduced when the subset of variables with nonzero coefficients is unknown. As noted earlier, under an orthonormal design (such as the case of NSC), the LASSO solution can be obtained via soft thresholding. Similarly, these three regression methods also have simple solutions in the NSC setting, so the computation is easy and fast. While the LASSO solution yields biased estimates for large coefficients, these methods produce unbiased estimates. Several researchers have considered the use of the alternative shrinkage methods in place of soft shrinkage [2, 13, 14]. In this article, we will evaluate the performances of these alternative shrinkage methods by comparing them with conventional soft shrinkage systematically through simulation and real data studies.

Blockwise additive penalties, which were discussed in Antoniadis and Fan [15], were shown to give alternative genewise shrinkage estimators of the class centroids in the NSC setting, where the block is the gene. Similar to the methods of Wang and Zhu [3], these estimators use the fact that the centroids from the same gene should be treated as a group, but they are less computationally intensive than those of Wang and Zhu [3] because an iterative procedure is not involved. The approach of Blagus and Lusa [9] was also applied for our alternative (genewise) penalties to further improve NSC, especially for class-imbalanced data.

In the Methods section, we described the penalized least squares framework for general shrinkage methods using the model of Wang and Zhu [3], which includes the special case of NSC. We examined the performance of NSC with alternative penalty functions (ALT-NSC), which include the SCAD, the adaptive LASSO and the MCP. We also described how ALT-NSC can be used for genewise inference (GEN-NSC). In the Simulation section, we conducted simulation studies and showed that ALT-NSC and GEN-NSC have substantially better performance than NSC in terms of predictive accuracy and feature selection in data sets with multiple classes. In the Real Data Study section, we applied the proposed variants of NSC to several real microarray data sets. A discussion and concluding remarks are provided in the last two sections.

Methods

Penalized least squares for the nearest shrunken centroids

Adapting the framework of Wang and Zhu [3], let xij be the gene expression for the jth gene of the ith sample (j = 1, …, p; i = 1, …, n). There are K classes and each sample i belongs to one of K classes, that is iCk, where Ck is the set of sample indexes belonging to class k ∈ {1, …, K}, and nk is the number of samples for class k. The average expressions for the jth gene in the kth class and over the entire data set are and respectively.

Let where sj is the pooled within-class standard deviation for the jth gene: and . Alternatively, sj + s0 can be used instead of sj to prevent the genes with low expression levels from having large values by chance due to very small sj values, where s0 is a small constant. The statistic is equivalent to dkj, in Tibshirani et al. [1, 2], which is a t-statistic for the jth gene comparing class k to the average of the other classes.

Let and consider the following linear model: (1) where zik = 1 if sample i belongs to the class k, and 0 otherwise, μkj is a parameter to be estimated and εij is an independent error term that has variance if sample i belongs to class k. For a fixed gene index j, μkj is a deviation from the overall mean, so we have the constraint that . The class index to which sample i belongs is denoted by k(i) ∈ {1, …, K}. By multiplying to both sides in Eq (1), we have (2) where and . In vector notation, Eq (2) can be written as where where AT denotes the transpose of a vector or matrix A. The design matrix W = (W1, …, WKp) is an np × Kp matrix, where Wl is a np × 1 vector that corresponds to the lth element of the vector μ for l = 1, …, Kp. If an index l belongs to a class index k and a gene index j, then nk(i) elements of Wl are and the rest of the elements are zeros because there are exactly nk(i) samples belonging to class k(i). This implies that . In addition, we can see that each row of W has only one non-zero value and the rest of the elements are zero. This is because each yij takes only one μkj, and this implies for (lh). Thus, W is orthonormal. Note that μ0 = WT y* is the least squares estimator for μ, and let . Since W is orthonormal, a form of the penalized least squares is given by [10]: (3) where pλ(⋅) = λp(⋅) is a penalty function and when A = (a1, …, an)T. The problem of minimizing Eq (3) with respect to μkj is equivalent to minimizing it componentwise. By ignoring , which is irrelevant to the parameters, this allows us to consider the following penalized least squares problem: (4)

Eq (4) shows that the minimization problem Eq (3) has been converted to a univariate minimization problem. Since, the univariate solutions for regression coefficients are presented in the papers describing these penalized regression methods, we can use these solutions to obtain . NSC uses the LASSO penalty function pλ(|μkj|) = λ|μkj| [4], and the resulting estimator for μkj is given by where “sgn” is a sign function and z+ is the positive part of z. The LASSO solution is equivalent to the soft shrinkage estimate [16]. The resulting estimators for μkj under alternative penalties are presented in the next subsection.

To predict the class of a new sample , we define the discriminant score for class k as where is a shrunken mean and πk = nk/n is a prior probability estimate for class k. The shrunken mean and the discriminant score depend on the shrinkage method used, hence the choice of shrinkage method affects class prediction and gene selection. Finally, the classification rule is given by

Alternative shrinkage methods (ALT-NSC)

Here, we described several shrinkage methods that are possible alternatives to soft shrinkage. The first order derivative of the SCAD penalty function [10] is defined as for some a > 2. This penalty function gives smaller penalties on larger coefficients. The resulting estimator for μkj is

If a is close to 2, then SCAD behaves like a hard shrinkage estimate when estimating μkj.

The adaptive LASSO penalty function [11], which is the LASSO penalty function with a data-dependent weight, is given by where a > 0. The resulting solution is or

The adaptive LASSO solution is equivalent to soft shrinkage when a = 0 and is similar to the nonnegative garotte when a = 1 [17] (although the nonnegative garotte requires additional sign restrictions).

The MCP penalty function [12] is defined as where a > 1. The resulting solution is given by

The MCP solution is equivalent to firm shrinkage, which offers advantages over soft and hard shrinkage [18]. The MCP solution approaches hard shrinkage as a → 1 and soft shrinkage as a → ∞.

As mentioned previously, these shrinkage methods are known to have oracle properties under some mild conditions (for details, see [10], [11] and [12]). The LASSO solution is inconsistent because it produces estimates biased toward zero. This bias in the LASSO can also cause its variable selection to be inconsistent [12]. The basic reason that the alternative shrinkages can produce better estimates is because they have different rules for estimating the coefficients μkj, which depend on the size of |μkj|. When the sizes of the coefficients are large, these procedures leave them almost unpenalized (or completely unpenalized). Thus, they overcome the tendency of soft shrinkage to produce biased estimates.

While soft shrinkage has one tuning parameter λ, the alternative shrinkage methods have two tuning parameters, namely a and λ. The tuning parameter a controls the size of the penalties for large coefficients. The tuning parameters are determined by cross validation (CV). In our subsequent analysis, six values of the tuning parameter a were examined for each ALT-NSC and genewise shrinkage method: (0.5, 1, 1.5, 2, 2.5, 3) for the adaptive LASSO penalty, (2.01, 2.2, 2.5, 2.8, 3.2, 3.7) for the SCAD penalty and (1.01, 1.3, 1.7, 2, 2.5, 3) for the MCP penalty. For each method, thirty values of λ were considered. For the case when there are ties among the CV prediction accuracies or g-means, we chose the parameters resulting in a smaller number of genes.

Genewise shrinkage methods (GEN-NSC)

Here we extend the shrinkage methods discussed in the previous subsection to genewise inference. Let μj = (μ1j, …, μKj)T denote a K × 1 mean vector for the jth gene. Further let denote the corresponding mean estimator vector. The objective function to be minimized for the genewise penalized least squares estimator is (5)

Note that instead of penalizing μkj, we penalize the vector μj. Using the fact that and the orthonormality of W, Eq (5) can be written as (6)

The solution to Eq (6) is genewise separable, and thus one may solve it by minimizing (7)

Using the result of Antoniadis et al. [15], the solution to Eq (7) is given by (8) where is the solution to (9)

Since Eq (8) depends on the penalty function pλ(⋅), we can derive genewise shrinkage methods under diverse penalty functions. Note that the problem of solving Eq (9) is equivalent to that of Eq (4), and thus, the computational complexity of the genewise shrinkages is the same as that of the alternative shrinkages.

When the LASSO penalty is employed,

If the SCAD penalty is used, where a > 2. For the adaptive LASSO penalty, the resulting solution is

For the MCP penalty, where a > 1.

The thresholding rules of the genewise shrinkage methods are determined by instead of an individual . By pulling information from the neighboring mean estimators belonging to the same gene, the genewise shrinkage may allow the accuracy of the thresholding mean estimators to be improved. Furthermore, Eq (5) has a nice Bayesian interpretation [15]: the genewise penalized least squares method models the mean coefficients belonging to the same gene by using proper prior distributions.

Geometric mean methods (GM)

Adpating the idea of Blagus and Lusa [9], we considered the optimal values of tuning parameters (a, λ) of the (genewise) althernative shrinkage methods to be those that maximize the geometric mean of the class-specific prediction accuracies.

Throughout the remainder of this manuscript, we will refer to the genewise version of each method by adding “G” to the beginning of its abbreviated name. Moreover, when the tuning parameters are determined by the geometric mean, we will add “GM-” to the beginning of the name. For example, “GADA” refers to the genewise version of adaptive lasso, and “GM-GADA” referes to “GADA” whose tunnig parameters are determined by the geometric mean.

Simulations

In this section, we conducted simulation studies to compare ALT-NSC, GEN-NSC, and the GM versions of ALT-NSC and GEN-NSC to conventional NSC. We examined the overall prediction accuracy (PA), geometric mean (g-mean), area under the curve (AUC, only for a two-class classification scenario), sensitivity (SEN) and positive predictive value (PPV). SEN is the number of detected important genes divided by total number of important genes. PPV is the number of detected important genes divided by total number of genes the method selects. As in Dudoit et al. [19], we presented the median and upper quartiles of the evaluation measures.

In a two class classification scenario, we generated two classes from multivariate normal distributions with sample sizes, n1 = 1 and nn1: MVN(μ1, Σ) and MVN(μ2, Σ), each had a dimension of p = 2500. μ1 was equal to 0 for all genes and μ2 was 0.5 for 100 genes and 0 for the rest of genes. The differentially expressed (DE) 100 genes were randomly selected. As in Guo et al. [5] and Pang et al. [6], Σ was a block diagonal matrix with each diagonal block Σρ having an auto-regressive structure and alternating in sign. The block size was 50 × 50 and there were 50 blocks, which gave a total of 2500 genes: ρ took values of 0.5 and 0.9, indicating sparse and dense correlation blocks, and π1 took values of 0.5 and 0.8, corresponding to class-balance and -imbalance.

The three class classification scenario is very similar to the previous one. We generated three classes from multivariate normal distributions with the fixed proportions (π1, π2, π3) = (0.4, 0.2, 0.4): MVN(μ1, Σ), MVN(μ2, Σ) and MVN(μ3, Σ), each of which had the same dimension as the previous scenario. Ninety differentially expressed genes were randomly selected and those DE genes had mean vectors of (γ, 0,−γ). We used the same Σ as in the first simulation. We let γ take the values of 0.5 and 0.1 to study how the effect size of DE genes is related to the performances of the classifiers.

Given a, the tuning parameter λ was chosen to minimize the m-fold CV misclassification error rate on training data set, and we let m = 5. We generated training data sets with sample size 100 and test data sets with 10 times the sample size of the training data. Then test error rates were computed using the tuning parameters selected by CV. Gene selection was performed in the same way as in Tibshirani et al. [1, 2], where the genes with at least one nonzero difference were selected (the jth gene is selected if there exists at least one k such that ).

Simulation results for the two-class scenario have been presented in Tables 1, 2, 3 and 4. All of the proposed methods performed very similarly to NSC in terms of PA, g-mean and AUC except when the diagonal block matrix was dense and class was imbalanced; ALT-NSC improved the g-mean slightly, the GM versions of ALT-NSC and GEN-NSC also improved the g-mean, and GM-GSCAD had the highest g-mean in this setting. The classifiers had poorer prediction perfromance based on PA, g-mean and AUC when the block diagonal matrix was dense and classes were imbalanced. Gene selection accuracy (SEN and PPV) also decreased when class was imbalanced.

thumbnail
Table 1. Two groups with sparse block diagonal structure (ρ = 0.5) and class-balance (π1 = 0.5).

https://doi.org/10.1371/journal.pone.0171068.t001

thumbnail
Table 2. Two groups with dense block diagonal structure (ρ = 0.9) and class-balance (π1 = 0.5).

https://doi.org/10.1371/journal.pone.0171068.t002

thumbnail
Table 3. Two groups with sparse block diagonal structure (ρ = 0.5) and class-imbalance (π1 = 0.8).

https://doi.org/10.1371/journal.pone.0171068.t003

thumbnail
Table 4. Two groups with dense block diagonal structure (ρ = 0.9) and class-imbalance (π1 = 0.8).

https://doi.org/10.1371/journal.pone.0171068.t004

Simulation results for the three-class scenario have been presented in Tables 5, 6, 7 and 8. Unlike the two-class scenario, class sizes were always imbalanced. However, we considered different values of γ, which was the effect size of DE genes, and observed that our proposed methods performed significantly better than NSC. When the effect size of DE genes was moderate (γ = 0.5), only the ALT-NSC had better PA and g-mean, but GEN-NSC and GM methods showed no improvement. Under the very small effect size, all the classifiers performed very similarly in terms of PA, but their performance varied with respect to g-mean. MCP had the highest g-mean, and the other penalty functions gave zero as the median quartile of g-mean. Secondly, GM significantly improved g-mean for all the penalty functions, and the amount of the improvement was greater when the genewise penalties were used for the sparse block matrix. Finally, gene selection was also improved by GM: SEN increased and PPV stayed at almost the same value, compared to the corresponding methods based on the cross-validation prediction accuracy criterion.

thumbnail
Table 5. Three groups with class-imbalance (π1 = 0.4, π2 = 0.2, π3 = 0.4), sparse block diagonal structure (ρ = 0.5) and moderate mean difference (γ = 0.5).

https://doi.org/10.1371/journal.pone.0171068.t005

thumbnail
Table 6. Three groups with class-imbalance (π1 = 0.4, π2 = 0.2, π3 = 0.4), dense block diagonal structure (ρ = 0.9) and moderate mean difference (γ = 0.5).

https://doi.org/10.1371/journal.pone.0171068.t006

thumbnail
Table 7. Three groups with class-imbalance (π1 = 0.4, π2 = 0.2, π3 = 0.4), sparse block diagonal structure (ρ = 0.5) and small mean difference (γ = 0.1).

https://doi.org/10.1371/journal.pone.0171068.t007

thumbnail
Table 8. Three groups with class-imbalance (π1 = 0.4, π2 = 0.2, π3 = 0.4), dense block diagonal structure (ρ = 0.9) and small mean difference (γ = 0.1).

https://doi.org/10.1371/journal.pone.0171068.t008

Real data study

In this section, we applied conventional NSC and the proposed methods (ALT-NSC and GEN-NSC) to four real microarray data sets. The main characteristics of the four microarray data sets are presented in Table 9.

thumbnail
Table 9. Characteristics of the real microarray data sets.

https://doi.org/10.1371/journal.pone.0171068.t009

The Gravier et al. [20] data set came from a breast cancer study that consists of 111 patients with no events and 57 patients with early metastasis after diagnosis. The Pomeroy et al. [21] data set is a CNS cancer study that consists of 10 medulloblastomas, 10 CNS AT/RTs (renal and extrarenal rhabdoid tumors), 8 supratentorial PNETs and 10 non-embryonal brain tumors (malignant glioma). The Yeoh et al. [22] data set is a acute lymphoblastic leukemia (ALL) study that consists of six types of pediatric ALL subtypes: 43 T-cell lineage ALL (T-ALL), 27 E2A-PBX1, 79 TEL-AML1, 20 MLL rearrangements, 15 BCR-ABL, and 64 hyperdiploid karyotypes with more than 50 chromosomes (HK50). The Ramaswamy et al. [23] data set consists of 14 types of cancer samples as follows: 12 breast adenocarcinoma, 14 prostate adenocarcinoma, 12 lung adenocarcinoma, 12 colorectal adenocarcinoma, 22 lymphoma, 11 bladder transitional cell carcinoma, 10 melanoma, 10 uterine adenocarcinoma, 30 leukemia, 11 renal cell carcinoma, 11 pancreatic adenocarcinoma, 12 ovarian adenocarcinoma, 11 pleural mesothelioma and 20 central nervous system.

We randomly split each data set into a training set and a test set with 33% of the data allocated to the test set. This process was iterated 100 times. We chose optimal tuning parameters (a, λ) as the values that give the maximum of 5-fold CV prediction accuracy or g-mean under the training data set. We compared prediction accuracy, g-mean, AUC (only for the Gravier data set) and the number of selected genes.

The Gravier data set is slightly imbalanced; the proportions of “no events” and “early metastasis” are 0.66 and 0.34. There was no improvement of PA, g-mean and AUC by the proposed methods, but using the alternative penalty functions reduced N-sig. MCP had the smallest N-sig (Table 10). The Pomeroy data set is balanced. ALT-NSC improved g-mean, but not PA, and reduced N-sig, with the exception of SCAD. GM did not improve either PA or g-mean. MCP performed the best with higher PA and g-mean and smaller N-sig compared to NSC. GMCP performed very similarly to MCP with slightly inferior prediction performance but much smaller N-sig (Table 11). The Yeoh data set is imbalanced. Like the Golub data set [24], the prediction was easy for this data set despite the large number of classes. PA and g-mean were not improved, but N-sig was reduced by the proposed methods. Both ALT-NSC and GEN-NSC reduced N-sig, with GMCP having the smallest N-sig. GMCP selected 418 genes, while NSC selected 1456 genes at the median quartile (Table 12). The Ramaswamy data set has a large number of classes, and, as a result, all the classifeirs had low g-mean values. All the MCP methods and GM-SCAD had positive g-mean values, but the other methods had zero as the 90% quantile of g-mean (Table 13).

thumbnail
Table 10. Gravier (2010) data set: Breast cancer study with 2 classes.

https://doi.org/10.1371/journal.pone.0171068.t010

thumbnail
Table 11. Pomeroy (2002) data set: CNS study with 4 classes.

https://doi.org/10.1371/journal.pone.0171068.t011

thumbnail
Table 12. Yeoh (2002) data set: Leukemia study with 6 classes.

https://doi.org/10.1371/journal.pone.0171068.t012

thumbnail
Table 13. Ramaswamy (2001) data set: Cancer study with 14 classes.

https://doi.org/10.1371/journal.pone.0171068.t013

Discussion

In this article, we proposed several variations of NSC that use alternative genewise shrinkages. We derived these methods using three penalized regression models that enjoy oracle properties and have closed-form solutions under an orthonormal design. We also further modified these variants of NSC by adapting genewise penalty functions that use the correlations between the parameters belonging to the same gene, and the geometric mean approach for class-imbalanced data. We showed that these methods have better performance than conventional NSC in terms of prediction accuracy, g-mean and gene selection through simuation and real data studies.

We conducted simulation studies to evaluate the proposed methods. We used a block diagonal covariacne matrix with the block being an auto-regresive structure with a paramter ρ. When ρ is small, the block matrix becomes sparse, and thus it behaves like an identity matrix. Ohterwise, when ρ is large, the block matrix becomes dense, and thus it behaves like a block exchageable matrix. We varied ρ, the degree of class imbalance and the effect size of DE genes. The proposed methods had better peformance in terms of prediction accuracy and gene selection compared to NSC when the block matrix was dense and class was imbalanced. When the effect size was moderate, ALT-NSC methods performed well and among those MCP performed the best. When the effect size is small, GM method performed well with the highest g-mean.

We applied the proposed methods to four real microarray data sets. The proposed methods improved the g-mean, but not the overall prediction accuracy, in the data sets we considered. When the number of classes was two (Gravier data set) or prediction was easy (Yeoh data set), only gene selection was improved by the alternative penalty functions. In the data set with the moderate number of classes (Pomeroy data set), g-mean was improved by the alternative penalty functions. When the data set had very large number of classes (Ramaswamy data set), using the genewise penalty functions reduced the performance.

In many applications, it is desirable to develop classifiers that use the smallest possible number of genes. For example, one may wish to use an RT-PCR assay to discriminate between different types of tumors or to determine the prognosis of a patient with a given tumor type. Such an assay will be prohibitively expensive if the expression levels of more than a handful of genes are needed. Thus, a classification method that produces comparable accuracy to another method using fewer genes would be considered superior in these situations. Hence, the fact that our proposed methods consistently use fewer genes than conventional NSC represents a significant advantage of our methods even if prediction accuracy is not always improved. MCP would be very useful in real applications becacuse they have shown to select the most reliable parsimonious gene set with competitive predictive accuracy.

Both simulation and real data studies showed that our proposed methods produced greater improvement compared to conventional NSC in the data sets with three or four classes, but not in data sets with very large numbers of classes. When the number of classes is large, the sample size per class is usually small, and this affects the efficiency of shrunken mean estimators. By the virtue of the oracle property, ALT-NSC can produce more efficient estaimtes of the shrunken means, which yields better performance on both prediction and gene selection. Genewise shrinkages also improve the NSC classifier by combining the related genes in the same class, producing more accurate estimates when the size of the class is small (which commonly occurs when the number of classes is large). Clearly, the genewise penalty (GEN-NSC) shrinks a mean estimator toward zero faster than the non-genewise penalty (ALT-NSC), as shown in the simulations and the real data study. Appropriately fast shrinkage will be able to remove noisy genes effectively. However, one observes that when the number of classes is large, such as the Ramaswamy data set, the amount of shrinkage produced by the genewise penalty is so large that NSC loses some prediction accuracy. Thus, applying GEN-NSC to data sets with too many classes may not be recommended when the objective is to maximize predictive accuracy (rather than minimize the number of selected genes).

The performance of the methods can be affected by heterogeneity of gene expression, and this heterogeneity happens when the variances of genes differ by groups. This was pointed out by Tibshirani et al. [2] and was observed in the compariative study of Lee et al. [25]. Tibshirani et al. [2] suggested an ad-hoc method to account for heterogeneity. However, that method is only applicable in the case where class centroids are not separated. The method of Pang et al. [6] may overcome this problem because it combines the linear and quadratic discriminant scores, where the latter assumes unequal variances by classes. Since the method does not perform varaible selection, applying the mean shrinkage to their method would be a future research to handle this heterogeneit problem.

Supporting information

S1 Rscript. R source code.

This file contains the R functions that implement ALT-NSC and GEN-NSC.

https://doi.org/10.1371/journal.pone.0171068.s001

(ZIP)

Acknowledgments

This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIP) (No. 2016943438). Eric Bair was partially supported by NIH/NIDCR grant R03DE02359, NIH/NCATS grant UL1RR02574, and NIH/NIEHS grant P03ES01012.

Author Contributions

  1. Conceptualization: BYC EB JWL.
  2. Data curation: BYC.
  3. Formal analysis: BYC.
  4. Funding acquisition: JWL.
  5. Investigation: BYC EB JWL.
  6. Methodology: BYC EB JWL.
  7. Project administration: JWL.
  8. Resources: BYC.
  9. Software: BYC.
  10. Supervision: EB JWL.
  11. Validation: EB JWL.
  12. Visualization: BYC.
  13. Writing – original draft: BYC.
  14. Writing – review & editing: BYC EB JWL.

References

  1. 1. Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS. 2002;99(10):6567–6572. pmid:12011421
  2. 2. Tibshirani R, Hastie T, Narasimhan B, Chu G. Class prediction by nearest shrunken centroids, with applications to dna microarrays. Stat Sci. 2003;18(1):104–117.
  3. 3. Wang S, Zhu J. Improved centroids estimation for the nearest shrunken centroid classifier. Bioinformatics. 2007;23(8):972–979. pmid:17384429
  4. 4. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B Stat Methodol. 1996;58(1): 267–288.
  5. 5. Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 2007;8(1):86–100. pmid:16603682
  6. 6. Pang H, Tong T, Zhao H. Shrinkage-based diagonal discriminant analysis and its applications in high-dimensional data. Biometrics. 2007;65:1021–1029.
  7. 7. Shao J, Wang Y, Deng X, Wang S. Sparse linear discriminant analysis by thresholding for high dimensional data. Ann Stat. 2011;39(2):1241–1265.
  8. 8. Mai Q, Zou H, Yuan M. A direct approach to sparse discriminant analysis in ultra-high dimensions. Biometrika 2012;99(1):29–42.
  9. 9. Blagus R, Lusa L. Improved shrunken centroid classifiers for high-dimensional class-imbalanced data. BMC Bioinformatics 2013;14(64):1–13.
  10. 10. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. JASA. 2001;96(456):1348–1360.
  11. 11. Zou H. The adaptive lasso and its oracle properties. JASA. 2006;101(476):1418–1429.
  12. 12. Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38(2):894–942.
  13. 13. Witten DM, Tibshirani R. A framework for feature selection in clustering. JASA. 2010;105:713–726. pmid:20811510
  14. 14. Witten DM. Classification and clustering of sequencing data using a Poisson model. Ann Appl Stat. 2011;5:2493–2518.
  15. 15. Antoniadis A, Fan J. Regularization of wavelet approximations. JASA. 2001;96:939–967.
  16. 16. Donoho DL, Johnstone JM. Ideal spatial adaptation by wavelet shrinkage. Biometrika. 1994;81(3):425–455.
  17. 17. Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37(4):373–384.
  18. 18. Gao HY, Bruce AG. Waveshrink with firm shrinkage. Stat Sinica. 1997;7:855–874.
  19. 19. Dudoit S, Fridlyand J, Speed T. Comparison of discrimination methods for the classification of tumors using gene expression data. JASA. 2002;97(457):77–87.
  20. 20. Gravier E, Pierron G, Vincent-Salomon A, Gruel N, Raynal V, Savignoni A, et al. A prognostic dna signature for T1T2 node-negative breast cancer patientsg. Genes Chromosomes Cancer. 2010;49:1125–1134. pmid:20842727
  21. 21. Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, McLaughlin ME, et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002;415(6870):436–442. pmid:11807556
  22. 22. Yeoh E-J, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell. 2002;1:133–143. pmid:12086872
  23. 23. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, et al. Multiclass cancer diagnosis using tumor gene expression signatures. PNAS. 2001;98(26):15149–15154. pmid:11742071
  24. 24. Golub T, Slonim D, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–537. pmid:10521349
  25. 25. Lee J, Lee J, Park M, Song S. An extensive comparison of recent classification tools applied to microarray data. Comput Stat Data Anal. 2005;48(4):869–885.