Top

Advances in Data Analysis and Classification

Published in:

Open Access 15-11-2019 | Regular Article

Sparse classification with paired covariates

Authors: Armin Rauschenberger, Iuliana Ciocănea-Teodorescu, Marianne A. Jonker, Renée X. Menezes, Mark A. van de Wiel

Published in: Advances in Data Analysis and Classification | Issue 3/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

This paper introduces the paired lasso: a generalisation of the lasso for paired covariate settings. Our aim is to predict a single response from two high-dimensional covariate sets. We assume a one-to-one correspondence between the covariate sets, with each covariate in one set forming a pair with a covariate in the other set. Paired covariates arise, for example, when two transformations of the same data are available. It is often unknown which of the two covariate sets leads to better predictions, or whether the two covariate sets complement each other. The paired lasso addresses this problem by weighting the covariates to improve the selection from the covariate sets and the covariate pairs. It thereby combines information from both covariate sets and accounts for the paired structure. We tested the paired lasso on more than 2000 classification problems with experimental genomics data, and found that for estimating sparse but predictive models, the paired lasso outperforms the standard and the adaptive lasso. The R package palasso is available from cran.

Supplementary material 1 (pdf 1418 KB)

Electronic supplementary material

The online version of this article (https://doi.org/10.1007/s11634-019-00375-6) contains supplementary material, which is available to authorized users.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Background

Lasso regression has become a popular method for variable selection and prediction. Among other things, it extends generalised linear models to settings with more covariates than samples. The lasso shrinks the coefficients towards zero, setting some coefficients equal to zero. Compared to the standard lasso, the adaptive lasso shrinks large coefficients less. In high-dimensional spaces, most coefficients are set to zero, since the number of non-zero coefficients is bounded by the sample size (Zou and Hastie 2005). It is possible to decrease the maximum number of non-zero coefficients, and estimate the coefficients given this sparsity constraint. By including fewer covariates, the resulting model may be less predictive but more practical and interpretable. Given an efficient algorithm that produces the regularisation path, we can extract models of different sizes without increasing the computational cost.

Paired covariates arise in many applications. Possible origins include two measurements of the same attributes, and two transformations of the same measurements. The covariates are then in two sets, with each covariate in one set forming a pair with a covariate in the other set. These covariate sets may be strongly correlated. Naively, we could either exclude one of the two sets or ignore the paired structure. However, we want to include both sets, and account for the paired structure. Such a compromise potentially improves predictions.

Our motivating example is to predict a binary response from microrna isoform (isomir) expression quantification data. Micrornas help to regulate gene expression and are dysregulated in cancer. Typically, most raw counts from such sequencing experiments equal zero. Different transformations of rna sequencing data lead to different predictive abilities (Zwiener et al. 2014), and knowledge about the presence or absence of an isomir might be more predictive than its actual expression level (Telonis et al. 2017). We hypothesise that combining two transformations of isomir data, namely a count and a binary representation, improves predictions. We also analysed other molecular profiles to show the generality of our approach.

The paired lasso, like the group lasso (Yuan and Lin 2006) and the fused lasso (Tibshirani et al. 2005), is an extension of the lasso for a specific covariate structure. If the covariates are split into groups, we could use the group lasso to select groups of covariates. If the covariates have a meaningful order, we could use the fused lasso to estimate similar coefficients for close covariates. And if there are paired covariates, we recommend the paired lasso to weight among and within the covariate pairs.

Our aim is to create a sparse model for paired covariates. The paired lasso exploits not only both covariate sets but also the structure between them. We demonstrate that it outperforms the standard and the adaptive lasso in a number of settings, while also showing its limitations.

In the following, we introduce paired covariate settings and the paired lasso (Sect. 2), classify cancer types based on two transformations of the same molecular data (Sect. 3), discuss sparsity constraints and potential applications to other paired settings (Sect. 4), and predict survival from gene expression in tumour and normal tissue (see appendix).

2 Method

2.1 Setting

Data are available for n samples, one response and twice p covariates. We allow for continuous, discrete, binary and survival responses. We assume all covariates are standardised, and the setting is high-dimensional (\({p \gg n}\)). Let the \({n \times 1}\) vector \({\varvec{y}}\) represent the response, the \({n \times p}\) matrix

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq5_HTML.gif

the first covariate set, and the \({n \times p}\) matrix

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq7_HTML.gif

the second covariate set:

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_Equ1_HTML.png

The one-to-one correspondence between

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq8_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq9_HTML.gif

gives rise to paired covariates. In practice, the two covariate sets may represent different transformations of the same data. For each j in \(\{1,\ldots ,p\}\), the \({n \times 1}\) covariate vectors

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq12_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq13_HTML.gif

represent one covariate pair.

We relate the response to the covariates through a generalised linear model. The linear predictor for any sample i in \(\{1,\ldots ,n\}\) equals

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_Equ2_HTML.png

where \(\alpha \) is the unknown intercept, and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq16_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq17_HTML.gif

are the unknown regression coefficients. We want to estimate a model with a limited number of non-zero coefficients (e.g.

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq18_HTML.gif

). Our ambition is to select the most predictive model given such a sparsity constraint. Although additional covariates could improve predictions, many applications require small model sizes.

Such models can be estimated by penalised maximum likelihood, i.e. by finding

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_Equ3_HTML.png

where

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq19_HTML.gif

is the likelihood which depends on the regression model (e.g. linear, logistic) and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq20_HTML.gif

is a penalty function, which we denote shortly by \(\rho (\lambda )\) in the remainder. Unlike ridge regularisation, lasso regularisation implies variable selection. The standard lasso (Tibshirani 1996) and the adaptive lasso (Zou 2006) have the penalty terms

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_Equ4_HTML.png

respectively, where the parameter \(\lambda \) and all estimates

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq23_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq24_HTML.gif

are non-negative. The regularisation parameter \(\lambda \) makes a compromise between the unpenalised model (\({\lambda =0}\)) and the intercept-only model (\({\lambda \rightarrow \infty }\)). Increasing \(\lambda \) decreases the number of non-zero coefficients. The purpose of the adaptive lasso is consistent variable selection and optimal coefficient estimation (Zou 2006). It requires the initial estimates

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq29_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq30_HTML.gif

(see below) for weighting the covariates. In high-dimensional settings, the adaptive lasso can have a similar predictive performance to the standard lasso while including less covariates (Huang et al. 2008). This makes the adaptive lasso promising for estimating sparse models.

2.2 Paired lasso

For the standard and the adaptive lasso, we have to decide whether the model should exploit

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq31_HTML.gif

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq32_HTML.gif

, or both. If we included only one covariate set, we would loose the information in the other covariate set. If we included both covariate sets, we would double the dimensionality and still ignore the paired structure. In contrast, the paired lasso exploits both covariate sets, and accounts for the paired structure.

We achieve this by choosing among four different weighting schemes: (1) within covariate set

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq33_HTML.gif

, (2) within covariate set

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq34_HTML.gif

, (3) among all covariates, or (4) among and within covariate pairs. The tuning parameter

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq35_HTML.gif

determines the weighting scheme. Each

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq36_HTML.gif

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq37_HTML.gif

leads to different weights

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq38_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq39_HTML.gif

for covariates

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq40_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq41_HTML.gif

, for any pair j:

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_Equ5_HTML.png

where

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq42_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq43_HTML.gif

are some initial estimates (see below). Figure 1 illustrates the four weighting schemes, by showing the sets of weights emanating from some initial estimates. The first three schemes are fallbacks to the adaptive lasso based on

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq44_HTML.gif

(

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq45_HTML.gif

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq46_HTML.gif

(

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq47_HTML.gif

), or both (

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq48_HTML.gif

). The pairwise-adaptive scheme (

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq49_HTML.gif

) is novel: it weights among and within covariate pairs. It depends on the data which weighting scheme leads to the most predictive model.

Leaving the weighting scheme

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq50_HTML.gif

free, we weight the covariates in the penalty term

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_Equ6_HTML.png

where \(\lambda \ge 0\) and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq52_HTML.gif

. All weights

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq53_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq54_HTML.gif

are in the unit interval. The inverse weights serve as penalty factors. Covariate

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq55_HTML.gif

has the penalty factor

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq56_HTML.gif

, and covariate

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq57_HTML.gif

has the penalty factor

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq58_HTML.gif

. By receiving infinite penalty factors, covariates with zero weight are automatically excluded. While methods like GRridge (van de Wiel et al. 2016) and ipflasso (Boulesteix et al. 2017) adapt penalisation to covariate sets, our penalty factors are covariate-specific. The penalty increases with both coefficients

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq59_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq60_HTML.gif

, but more with the one that has a larger penalty factor. We can thereby penalise the covariates asymmetrically: less if presumably important, and more if presumably unimportant.

Exploiting the efficient procedure for penalised maximum likelihood estimation from glmnet (Friedman et al. 2010), we use internal cross-validation to select \(\lambda \) from 100 candidates, and to select

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq62_HTML.gif

from four candidates. To avoid overfitting, we estimate the weights in each internal cross-validation iteration. The tuning parameter

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq63_HTML.gif

governs the type of weighting, and the tuning parameter \(\lambda \) determines the amount of regularisation. Despite the covariate-specific penalty factors, the paired lasso is only four times as computationally expensive as the standard lasso. Unlike cross-validating the weighting scheme, cross-validating all weights in

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq65_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq66_HTML.gif

would be computationally infeasible and likely prone to overfitting.

2.3 Initial estimators

Inspired by the adaptive lasso (Zou 2006), we estimate the effects of the covariates on the response in two steps, obtaining the initial and the final estimates from the same data. Suggested initial estimates for the adaptive lasso in high-dimensional settings include absolute coefficients from ridge (Zou 2006), lasso (Bühlmann and van de Geer 2011) and simple (Huang et al. 2008) regression. Marginal estimates have several advantages over conditional estimates. First, estimating conditional effects is hard in high-dimensional settings with strongly correlated covariates. Conditional estimation strongly depends on the type of regularisation. Second, estimating marginal effects is computationally more efficient than estimating conditional effects. Third, we can easily improve the quality of the marginal estimates by empirical Bayes, because standard errors are available (Dey and Stephens 2018).

We can obtain marginal estimates from simple correlation or simple regression. Even if the covariates are standardised, logistic regression on binary covariates sometimes leads to extreme coefficients. Instead of adjusting regression coefficients for different standard errors, we use correlation coefficients. Their absolute values are between zero and one, and thus interpretable as weights. Fan and Lv (2008) also use correlation for screening covariates. For linear, logistic and Poisson regression, we calculate the absolute Pearson correlation coefficients between the response and the standardised covariates:

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_Equ7_HTML.png

For Cox regression, we calculate the rescaled concordance indices between the right-censored survival time and the standardised covariates (\(C \rightarrow | 2 C - 1 |\)), which are interpretable as absolute correlation coefficients. To stabilise noisy estimates, we shrink

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq68_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq69_HTML.gif

separately towards zero, using the adaptive correlation shrinkage from CorShrink (Dey and Stephens 2018). This procedure Fisher-transforms the correlation coefficients to standard scores (\(\rho \rightarrow \text {artanh}(\rho )\)), uses an asymptotic normal approximation, performs the shrinkage by empirical Bayes, and transforms the shrunken standard scores back (\(z \rightarrow \text {tanh}(z)\)). Empirical Bayes implies that the data determine the amount of shrinkage. We denote the shrunken estimates by

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq72_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq73_HTML.gif

Although marginal and conditional effects of covariates may differ strongly, we conjecture covariates with strong marginal effects tend to be conditionally more important than those with weak marginal effects. Using the same hypothesis, Fan and Lv (2008) showed that reducing dimensionality by screening out covariates with weak marginal effects can improve model selection. For each combination of two covariates, we conjecture the one with the greater absolute correlation coefficient is conditionally more important than the other. Instead of comparing all coefficients at once, we compare them within the first covariate set, within the second covariate set, among all covariates, and simultaneously among and within the covariate pairs. These comparisons correspond to the four weighting schemes.

3 Results

We tested the paired lasso in 2048 binary classification problems. In each classification problem, we used one molecular profile to classify samples into two cancer types. Our paired covariates consist of two representations of the same molecular profile. We compared the paired lasso with the standard and the adaptive lasso.

3.1 Classification problems

Molecular tumour markers may improve cancer diagnosis, cancer staging and cancer prognosis. One may analyse blood or urine samples to detect cancer, classify cancer subtypes, predict disease progression, or predict treatment response. Because too few liquid biopsy data are available for reliably evaluating prediction models, we analyse tissue samples to classify cancer types, as a proof of concept. This is less clinically relevant, but allows a comprehensive comparison of models. The challenge is to select a small subset of features with high predictive power.

The Cancer Genome Atlas (tcga) provides genomic data for more than 11,000 patients. From the harmonised data, we retrieved gene expression quantification, microrna isoform (isomir) expression quantification, microrna (mirna) expression quantification, and “masked” copy number segments with TCGAbiolinks (Colaprico et al. 2016). Data are available for 19,602 protein-coding genes, 197,595 isomirs, and 1881 mirnas. The transcriptome profiling data are counts, and the copy number variation (cnv) data are segment mean values. We extracted the segment mean values at 10,000 evenly spaced chromosomal locations. The samples come from different types of material. We included primary solid tumour samples for all cancer types available, except in the case of leukaemia, where we included peripheral blood samples. For patients with replicate samples, we randomly chose one sample.

Analysing one molecular profile at a time, we classified the samples into cancer types. Depending on the molecular profile, the samples come from 32 or 33 cancer types, leading to \(\left( {\begin{array}{c}32\\ 2\end{array}}\right) = 496\) or \(\left( {\begin{array}{c}33\\ 2\end{array}}\right) = 528\) binary classification problems, respectively. In each classification problem, we classified samples from two cancer types, ignoring samples from other cancer types (Fig. 2).

We used double cross-validation with 10 internal and 5 external folds to tune the parameters and to estimate the prediction accuracy, respectively. In the outer cross-validation loop, we repeatedly \((5\times )\) split the samples into four external folds for training and validation \((80\%)\), and one external fold for testing \((20\%)\). In the inner cross-validation loop, we repeatedly \((10\times )\) split the samples for training and validation into nine inner folds for training \((72\%)\) and one inner fold for validation \((8\%)\). Training samples serve for estimating the coefficients

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq82_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq83_HTML.gif

, validation samples for tuning the parameters \(\lambda \) and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq85_HTML.gif

, and testing samples for measuring the predictive performance. As a loss function for logistic regression, we chose the deviance \(-2 \sum _{i=1}^n \{ y_i \log {(p_i)} + {(1-y_i)} {\log (1-p_i)} \}\), where \(y_i\) and \(p_i\) are the observed response and the predicted probability for individual i, respectively. Although we minimised the deviance to tune the parameters, we also calculated the area under the receiver operating characteristic curve (auc) and the misclassification rate to estimate the prediction accuracy. Since indirect maximisation might lead to suboptimal aucs (Cortes and Mohri 2004), we prefer the deviance as a primary evaluation metric.

3.2 Paired covariates

Transcriptome profiling data require some preprocessing. We preprocessed the expression counts for each cancer–cancer combination separately, using the same procedure for genes, isomirs and mirnas. The total raw count for an individual is its library size, and the total raw count for a transcript is its abundance. We used the trimmed mean normalisation method from edgeR (Robinson and Oshlack 2010) to adjust for different library sizes, and filtered out all transcripts with an abundance smaller than the sample size. This filtering removes non-expressed transcripts and lets the dimensionality increase with the sample size. Furthermore, we Anscombe-transformed the normalised expression counts (\(x \rightarrow {2\sqrt{x + 3/8}}\)).

Then we converted each molecular profile to paired covariates. The covariate matrix

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq90_HTML.gif

contains the “original” data, and the covariate matrix

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq91_HTML.gif

contains a compressed version, obtained in the following way:

Gene expression: Shmulevich and Zhang (2002) binarise microarray gene expression data by separating low and high expression values with an edge detection algorithm. For each gene j, we sorted the normalised counts in ascending order , and calculated the differences between consecutive values . Maximising \({H(i/n)} d_{ij}\) with respect to i, where \(H(\cdot )\) is the binary entropy function, we obtained the cutoff . The binary covariate indicates whether the continuous covariate is above this cutoff .
Isomir and mirna expression: Telonis et al. (2017) binarise isomir data by labelling the bottom \({80\%}\) and top \({20\%}\) most expressed isomirs of a sample as “absent” or “present”, respectively. Because we analysed samples from only two cancer types at a time, and filtered out low-abundance transcripts, this binarisation procedure would be unstable. Instead, we let the binary covariate matrix indicate non-zero expression counts.
Copy number variation: If c is a copy number, the corresponding segment mean value equals \({\log _2 (c/2)}\). Negative and positive values indicate deletions or amplifications, respectively. Without introducing lower and upper bounds, we only assigned values equalling zero to the diploid category. Accordingly, the ternary covariate matrix indicates the signs of the segment mean values.

Thus, we obtained two transformations of the same data: the continuous

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq105_HTML.gif

and the binary or ternary

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq106_HTML.gif

. Attribute j is represented by both

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq107_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq108_HTML.gif

. Preparing for penalised regression, we transformed all covariates to mean zero and unit variance.

3.3 Predictive performance

Natural competitors for the paired lasso are the standard and the adaptive lasso. We compared the paired lasso, exploiting both

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq109_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq110_HTML.gif

, with six competing models: the standard and the adaptive lasso exploiting either

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq111_HTML.gif

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq112_HTML.gif

, or both. We strive for very sparse models, as often desired in clinical practice. For now, each model may include up to 10 covariates.

We compared the predictive performance of the paired lasso and the competing models based on the cross-validated deviance. We speak of an improvement if the paired lasso decreases the deviance, and of a deterioration if the paired lasso increases the deviance. Compared to each competing model, the paired lasso leads to more improvements than deteriorations, for all molecular profiles (Fig. 3). According to the median deviance, the best competing model is the adaptive lasso based on

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq113_HTML.gif

for genes and isomirs, and the adaptive lasso based on

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq114_HTML.gif

for mirnas and cnvs. But the paired lasso is better in \({57\%}\), \({69\%}\), \({61\%}\) and \({54\%}\) of the cases, respectively. We also calculated the difference in deviance between the paired lasso and the competing models. The improvements tend to exceed the deteriorations (Fig. 3).

In addition to the deviance, we also examined the more interpretable auc and misclassification rate. For example, cnvs reliably separate testicular cancer (tgct) and ovarian cancer (ov) from most cancer types, but not ovarian from uterine cancer (ucec and ucs) (Fig. 4). Despite the sparsity constraint, the paired lasso achieves a median auc above 0.99 for genes, isomirs and mirnas, and a median auc of 0.94 for cnvs. The misclassification rates are \({0.4\%}\), \({0.6\%}\), \({0.4\%}\) and \({10.0\%}\), respectively. The reason for the extremely good separation is that the samples are not only from different cancer types, but also from different tissues. Comparisons are most meaningful for cnvs, for which the paired lasso indeed tends to greater aucs and smaller misclassification rates than the competing models (Fig. 5).

The next step is to test whether the paired lasso is significantly better than the competing models. For each molecular profile and each competing model, we calculated the difference in deviance between the paired lasso and the competing model. A setting with k cancer types leads to \({k \atopwithdelims ()2}\) differences in deviance. However, these values are mutually dependent because of the overlapping cancer types. We therefore cannot directly test whether they are significantly different from zero. Instead, we accounted for their dependencies.

We split the dependent values into groups of independent values. To increase power, we minimised the number of groups and maximised the group sizes. Given 32 cancer types, we split the 496 dependent values into 31 groups of 16 independent values (Fig. 6). Given 33 cancer types, we split the 528 dependent values into 33 groups of 16 independent values. After conducting the one-sided Wilcoxon signed-rank test within each group, we combined the 31 or 33 dependent p values with Simes combination test (Westfall 2005). This combination leads to one p value for each molecular profile and each competing model (Table 1). At the \({5\%}\) level, 22 out of 24 combined p values are significant. The insignificant improvements occur for gene expression with the adaptive lasso based on

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq125_HTML.gif

, and cnv with the adaptive lasso based on

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq126_HTML.gif

. We conclude that for these data the paired lasso is significantly better than the competing models.

Table 1

Combined p values

	Standard			Adaptive

gene	0.0003	0.0035	0.0034	0.0024		0.0242
isomir	0.0003	0.0011	0.0010	0.0021	0.0091	0.0147
mirna	0.0003	0.0003	0.0003	0.0305	0.0010	0.0066
cnv	0.0003	0.0003	0.0003		0.0011	0.0096

Each molecular profile (row) and each competing model (column) leads to one combined p value, indicating whether the paired lasso improves predictions. Among the combined p values, 22 are significant and 2 are insignificant (in brackets) at the \({5\%}\) level

3.4 Weighting schemes

After cross-validation, we trained the paired lasso with the full data sets. The paired lasso exploits all four weighting schemes, often including both covariate sets (\({46\%}\) for genes, \({49\%}\) for isomirs, \({55\%}\) for mirnas, and \({54\%}\) for cnvs) (Table 2). When including both covariate sets, it tends to weight among all covariates for genes (

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq138_HTML.gif

), but among and within covariate pairs for isomirs, mirnas and cnvs (

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq139_HTML.gif

). When including one covariate set, it tends to weight within

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq140_HTML.gif

for genes (

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq141_HTML.gif

), but within

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq142_HTML.gif

for isomirs, mirnas and cnvs (

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq143_HTML.gif

). On average, the covariates in

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq144_HTML.gif

receive a larger proportion of the total weight than those in

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq145_HTML.gif

(\({63\%}\) for genes, \({64\%}\) for isomirs, \({79\%}\) for mirnas, and \({60\%}\) for cnvs). Except for genes,

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq150_HTML.gif

receives a larger proportion of the non-zero coefficients than

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq151_HTML.gif

(\({36\%}\) for genes, \({58\%}\) for isomirs, \({82\%}\) for mirnas, and \({71\%}\) for cnvs). Often, the paired lasso does not merely select the most informative covariate set, but combines information from both covariate sets.

Table 2

Selected weighting schemes



gene	0.21	0.33	0.32	0.14
isomir	0.26	0.25	0.21	0.28
mirna	0.36	0.10	0.26	0.29
cnv	0.31	0.15	0.17	0.37

Depending on the molecular profile (row), the paired lasso favours different weighting schemes (columns). The entries are row proportions

Subject to at most 10 non-zero coefficients, the paired lasso has a better predictive performance than the standard and the adaptive lasso based on

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq164_HTML.gif

and/or

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq165_HTML.gif

. We repeated cross-validation with tighter and looser sparsity constraints. As the maximum number of non-zero coefficients increases, the differences between the paired lasso and the competing models decrease (Fig. 7). Alleviating the sparsity constraint allows the competing models to include more or all relevant predictors. This improves classifications, leaves less room for further improvements, and makes the pairwise-adaptive weighting less important. Nevertheless, without a sparsity constraint, the paired lasso leads to much sparser models than the standard lasso (Table 3).

The elastic net (Zou and Hastie 2005) is an alternative method for handling the strong correlation between the two covariate sets. Without a sparsity constraint, the elastic net might render much larger models than the paired lasso, and thereby lead to a better predictive performance. We fix the elastic net mixing parameter at \(\alpha =0.95\) (close to lasso) to obtain sparse and stable solutions (Friedman et al. 2010). Compared to the paired lasso, the elastic net includes more non-zero coefficients (Table 3), and thereby decreases the logistic deviance for \({67\%}\) of the genes, \({68\%}\) of the isomirs, \({83\%}\) of the mirnas, and \({83\%}\) of the cnvs. Given the same resolution in the solution path, the elastic net has more and larger jumps in the sequence of non-zero coefficients, because it renders larger models. We doubled the resolution for the elastic net to facilitate approaching sparsity constraints as close as possible. At the sparsity constraint 10, the paired lasso leads to a lower logistic deviance for more than \({95\%}\) of the genes, isomirs, mirnas, and cnvs. This confirms that the elastic net is good for estimating relatively dense models, and the paired lasso is good for estimating sparse models.

Table 3

Average numbers of non-zero coefficients

	Standard			Adaptive			Paired

gene	31	22	21	20	17	17	18
isomir	33	31	28	20	19	18	18
mirna	26	38	28	16	21	16	16
cnv	83	110	105	51	78	63	61

Without a sparsity constraint, the standard lasso includes more covariates than the adaptive and the paired lasso, for each molecular profile (row)

4 Discussion

We developed the paired lasso for estimating sparse models from paired covariates. It handles situations where it is unclear whether one covariate set is more predictive than the other covariate set, or whether both covariate sets together are more predictive than one covariate set alone.

Under a sparsity constraint, the paired lasso can have a better predictive performance than the standard and the adaptive lasso based on

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq180_HTML.gif

and/or

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq181_HTML.gif

. In our comparisons, the standard and the adaptive lasso each have three chances to beat the paired lasso: exploiting

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq182_HTML.gif

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq183_HTML.gif

, or both. Nevertheless, the paired lasso, automatically choosing from

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq184_HTML.gif

and

https://static-content.springer.com/image/art%3A10.1007%2Fs11634-019-00375-6/MediaObjects/11634_2019_375_IEq185_HTML.gif

, improves the best standard and the best adaptive lasso.

This improvement stems from introducing a pairwise-adaptive weighting scheme and choosing among multiple weighting schemes. A super learner (van der Laan et al. 2007) would combine predictions from multiple weighting schemes, improving predictions at the cost of interpretability. In contrast, the paired lasso attempts to select the most predictive combination of covariate sets, and the most predictive covariates.

Sparsity constraints should be employed regardless of whether the underlying effects are sparse or not. Their purpose is to make models as sparse as desired. Even if numerous covariates influence the response, we might still be interested in the top few most influential covariates. For example, a cost-efficient clinical implementation may require a limited number of markers. But if the standard lasso without a sparsity constraint returns a sufficiently sparse model, the sparsity constraint is redundant.

The paired lasso uses the response twice, first for weighting the covariates, and then for estimating their coefficients. This two-step procedure increases the weight of presumably important covariates, and decreases the weight of presumably unimportant covariates. Therefore, without an effective sparsity constraint, the paired lasso tends to sparser models than the standard lasso, and with an effective sparsity constraint, the paired lasso tends to more predictive models than the standard lasso.

Paired covariates arise in many genomic applications:

Molecular profiles with meaningful thresholds also include exon expression and dna methylation. Exons can have different types of effects on a clinical response. Some exons are retained for some samples, but spliced out for other samples. Other exons are retained for all samples, but with different expression levels. Both the change from “non-expressed” to “expressed” and the expression level might have an effect. We could match zero-indicators with count covariates to account for both types of effects. Similarly, beyond considering cpg islands as unmethylated or methylated, we could also account for methylation levels.
Some molecular profiles lead to categorical variables with three or more levels. Single nucleotide polymorphism (snp) genotype data take the values zero, one and two minor alleles. Depending on the effect of interest, we would normally construct indicators for “one or two minor alleles” to analyse dominant effects, indicators for “two minor alleles” to analyse recessive effects, or quantitative variables to analyse additive effects. Instead, we could include both indicator groups to account for all three types of effects. Similarly, we could represent cnv data as two sets of ternary covariates, the first indicating losses and gains, and the second indicating great losses and great gains.
Another source of paired covariates are repeated measures. If the same molecular profile is measured twice under the same conditions, the average might be a good choice. But less so if the same molecular profile is measured under different conditions. Then it might be better to match the repeated measures. An interesting application is to predict survival from gene expression in tumour ( ) and normal ( ) tissue collected from the vicinity of the tumour (Huang et al. 2016). We compared the paired lasso with the standard and the adaptive lasso based on and/or (see appendix). For at least five out of six cancer types, the paired lasso fails to improve the cross-validated predictive performance. We argue that sparsity might be a wrong assumption for these data, in particular for the survival response, which may be better accommodated by dense predictors like ridge regression (van Wieringen et al. 2009). Indeed, the standard lasso generally selects few or no variables for four cancer types. Moreover, adaptation fails to improve the standard lasso for another cancer type, leaving little room for improvement to the paired lasso, which is essentially a bag of adaptive lasso models. Finally, for one cancer type, the paired lasso is competitive with the adaptive lasso based on tumour tissue, both performing relatively well. The paired lasso has the practical advantage of automatically selecting from the covariate sets.
An omnipresent challenge is the integration of multiple molecular profiles (Gade et al. 2011; Bergersen et al. 2011; Aben et al. 2016; Boulesteix et al. 2017; Rodríguez-Girondo et al. 2017). The paired lasso is not directly suitable for analysing multiple molecular profiles simultaneously. However, for two molecular profiles with a one-to-one correspondence, the paired lasso can be used as an integrative model. A well-known example is messenger rna expression and matched dna copy number.
Paired main and interaction effects have the same paired structure as paired covariates. Since the paired lasso would treat the two sets of effects as two sets of covariates, it would violate the hierarchy principle. In this context, the group lasso was shown to be beneficial (Ternès et al. 2017). Although the paired lasso might also improve predictions, an adaptation would be required to enforce the hierarchy principle.

In paired covariate settings, there are two types of groups: covariate pairs and covariate sets. From each covariate pair, the paired lasso selects zero, one, or two covariates. Alternatively, the group lasso (Yuan and Lin 2006) would select either zero or two covariates, the exclusive lasso (Campbell and Allen 2017) at least one covariate, and the protolasso (Reid and Tibshirani 2016) at most one covariate. Although these methods were not designed for paired covariates, they might improve interpretability in some applications with paired covariates. However, it would be challenging to account for covariate pairs and covariate sets, because these are overlapping groupings.

We focussed on binary responses, but our approach also works with other univariate responses. Currently, our implementation supports linear, logistic, Poisson and Cox regression. Although it allows for \(L_1\) regularisation (lasso), \(L_2\) regularisation (ridge) and combinations thereof (elastic net), sparsity constraints require an \(L_1\) penalty, and the performance under an \(L_2\) penalty requires further research.

Acknowledgements

This research was funded by the Department of Epidemiology and Biostatistics, Amsterdam umc, vu University Amsterdam.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no potential conflicts of interest.

Reproducibility

The R package palasso contains a vignette for reproducing all results.

Software

The R package palasso runs on any operating system equipped with R-3.5.0 or later. It is available from cran under a free software license: https://CRAN.R-project.org/package=palasso.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

previous article Is-ClusterMPP: clustering algorithm through point processes and influence space towards high-dimensional data

next article Connecting the multivariate partial least squares with canonical analysis: a path-following approach

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 1418 KB)

Aben N, Vis DJ, Michaut M, Wessels LF (2016) TANDEM: a two-stage approach to maximize interpretability of drug response models based on multiple molecular data types. Bioinformatics 32(17):i413–i420. https://doi.org/10.1093/bioinformatics/btw449CrossRef

Bergersen LC, Glad IK, Lyng H (2011) Weighted lasso with data integration. Stat Appl Genet Mol Biol 10(1):39. https://doi.org/10.2202/1544-6115.1703MathSciNetCrossRefMATH

Boulesteix AL, De Bin R, Jiang X, Fuchs M (2017) IPF-LASSO: Integrative \(L_1\)-penalized regression with penalty factors for prediction based on multi-omics data. Comput Math Methods Med 2017:7691937. https://doi.org/10.1155/2017/7691937 (ipflasso)CrossRefMATH

Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Berlin. https://doi.org/10.1007/978-3-642-20192-9CrossRefMATH

Campbell F, Allen GI (2017) Within group variable selection through the exclusive lasso. Electron J Stat 11(2):4220–4257. https://doi.org/10.1214/17-EJS1317MathSciNetCrossRefMATH

Colaprico A, Silva TC, Olsen C, Garofano L, Cava C, Garolini D, Sabedot TS, Malta TM, Pagnotta SM, Castiglioni I et al (2016) TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res 44(8):e71. https://doi.org/10.1093/nar/gkv1507CrossRef

Cortes C, Mohri M (2004) AUC optimization vs. error rate minimization. In: Thrun S, Saul LK, Schölkopf B (eds) Advances in neural information processing systems 16. MIT Press, Cambridge, pp 313–320

Dey KK, Stephens M (2018) CorShrink: empirical Bayes shrinkage estimation of correlations, with applications. bioRxiv https://doi.org/10.1101/368316

Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B (Stat Methodol) 70(5):849–911. https://doi.org/10.1111/j.1467-9868.2008.00674.xMathSciNetCrossRefMATH

Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw. https://doi.org/10.18637/jss.v033.i01 (glmnet)CrossRef

Gade S, Porzelius C, Fälth M, Brase JC, Wuttig D, Kuner R, Binder H, Sültmann H, Beißbarth T (2011) Graph based fusion of miRNA and mRNA expression data improves clinical outcome prediction in prostate cancer. BMC Bioinform 12(1):488. https://doi.org/10.1186/1471-2105-12-488CrossRef

Huang J, Ma S, Zhang CH (2008) Adaptive lasso for sparse high-dimensional regression models. Stat Sin 18(4):1603–1618MathSciNetMATH

Huang X, Stern DF, Zhao H (2016) Transcriptional profiles from paired normal samples offer complementary information on cancer patient survival-evidence from TCGA pan-cancer data. Sci Rep 6:20567. https://doi.org/10.1038/srep20567CrossRef

Reid S, Tibshirani R (2016) Sparse regression and marginal testing using cluster prototypes. Biostatistics 17(2):364–376. https://doi.org/10.1093/biostatistics/kxv049MathSciNetCrossRef

Robinson MD, Oshlack A (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol 11(3):R25. https://doi.org/10.1186/gb-2010-11-3-r25 (edgeR)CrossRef

Rodríguez-Girondo M, Kakourou A, Salo P, Perola M, Mesker WE, Tollenaar RA, Houwing-Duistermaat J, Mertens BJ (2017) On the combination of omics data for prediction of binary outcomes. In: Datta S, Mertens BJ (eds) Statistical analysis of proteomics, metabolomics, and lipidomics data using mass spectrometry. Springer, Cham, pp 259–275. https://doi.org/10.1007/978-3-319-45809-0_14CrossRef

Shmulevich I, Zhang W (2002) Binary analysis and optimization-based normalization of gene expression data. Bioinformatics 18(4):555–565. https://doi.org/10.1093/bioinformatics/18.4.555CrossRef

Telonis AG, Magee R, Loher P, Chervoneva I, Londin E, Rigoutsos I (2017) Knowledge about the presence or absence of miRNA isoforms (isomiRs) can successfully discriminate amongst 32 TCGA cancer types. Nucleic Acids Res 45(6):2973–2985. https://doi.org/10.1093/nar/gkx082CrossRef

Ternès N, Rotolo F, Heinze G, Michiels S (2017) Identification of biomarker-by-treatment interactions in randomized clinical trials with survival outcomes and high-dimensional spaces. Biom J 59(4):685–701. https://doi.org/10.1002/bimj.201500234MathSciNetCrossRefMATH

Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288MathSciNetMATH

Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K (2005) Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B (Stat Methodol) 67(1):91–108. https://doi.org/10.1111/j.1467-9868.2005.00490.xMathSciNetCrossRefMATH

van de Wiel MA, Lien TG, Verlaat W, van Wieringen WN, Wilting SM (2016) Better prediction by use of co-data: adaptive group-regularized ridge regression. Stat Med 35(3):368–381. https://doi.org/10.1002/sim.6732 (GRridge)MathSciNetCrossRef

van der Laan MJ, Polley EC, Hubbard AE (2007) Super learner. Stat Appl Genet Mol Biol 6(1):25. https://doi.org/10.2202/1544-6115.1309MathSciNetCrossRefMATH

van Wieringen WN, Kun D, Hampel R, Boulesteix AL (2009) Survival prediction using gene expression data: a review and comparison. Comput Stat Data Anal 53(5):1590–1603. https://doi.org/10.1016/j.csda.2008.05.021MathSciNetCrossRefMATH

Westfall PH (2005) Combining \(P\) values. In: Armitage P, Colton T (eds) Encyclopedia of biostatistics. Wiley, Hoboken. https://doi.org/10.1002/0470011815.b2a15181CrossRef

Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67. https://doi.org/10.1111/j.1467-9868.2005.00532.xMathSciNetCrossRefMATH

Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429. https://doi.org/10.1198/016214506000000735MathSciNetCrossRefMATH

Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67(2):301–320. https://doi.org/10.1111/j.1467-9868.2005.00503.xMathSciNetCrossRefMATH

Zwiener I, Frisch B, Binder H (2014) Transforming RNA-Seq data to improve the performance of prognostic gene signatures. PLoS ONE 9(1):e85150. https://doi.org/10.1371/journal.pone.0085150CrossRef

Title: Sparse classification with paired covariates
Authors: Armin Rauschenberger
Iuliana Ciocănea-Teodorescu
Marianne A. Jonker
Renée X. Menezes
Mark A. van de Wiel
Publication date: 15-11-2019
Publisher: Springer Berlin Heidelberg
Published in: Advances in Data Analysis and Classification / Issue 3/2020
Print ISSN: 1862-5347
Electronic ISSN: 1862-5355
DOI: https://doi.org/10.1007/s11634-019-00375-6

Springer Professional

Sparse classification with paired covariates

Abstract

Electronic supplementary material

Publisher's Note

1 Background

2 Method

2.1 Setting

2.2 Paired lasso

2.3 Initial estimators

3 Results

3.1 Classification problems

3.2 Paired covariates

3.3 Predictive performance

3.4 Weighting schemes

4 Discussion

Acknowledgements

Compliance with ethical standards

Conflict of interest

Reproducibility

Software

Publisher's Note

Electronic supplementary material

Premium Partner

Springer Professional

Abstract

Electronic supplementary material

Publisher's Note

1 Background

2 Method

2.1 Setting

2.2 Paired lasso

2.3 Initial estimators

3 Results

3.1 Classification problems

3.2 Paired covariates

3.3 Predictive performance

3.4 Weighting schemes

4 Discussion

Acknowledgements

Compliance with ethical standards

Conflict of interest

Reproducibility

Software

Publisher's Note

Electronic supplementary material

Other articles of this Issue 3/2020

A stable cardinality distance for topological classification

Modelling heterogeneity: on the problem of group comparisons with logistic regression and the potential of the heterogeneous choice model

Is-ClusterMPP: clustering algorithm through point processes and influence space towards high-dimensional data

Rank tests for functional data based on the epigraph, the hypograph and associated graphical representations

Familywise decompositions of Pearson’s chi-square statistic in the analysis of contingency tables

Editorial for ADAC issue 3 of volume 14 (2020)

Premium Partner