Skip to main content
Top

Beyond Bonferroni: new multiple contrast tests for time-to-event data under non-proportional hazards

  • Open Access
  • 01-03-2026
Published in:

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This article delves into the challenges of analyzing time-to-event data with multiple groups or factorial designs, particularly under non-proportional hazards. It introduces novel multiple contrast test procedures (MCTPs) that address the limitations of traditional methods like the log-rank test and Bonferroni adjustments. The article presents extensive simulation studies that evaluate the performance of these new methods under various scenarios, including proportional hazards, non-proportional hazards, crossing hazards, and mixed settings. Additionally, it provides a real-world data example using the CoMMpass study on multiple myeloma patients, demonstrating the practical application of these methods. The results highlight the robustness and power of the new approaches, especially in scenarios where the proportional hazards assumption is violated. The article concludes with a discussion on the strengths and limitations of each method, offering valuable insights for researchers and practitioners in the field of survival analysis.

Supplementary Information

The online version contains supplementary material available at https://doi.org/10.1007/s10985-025-09676-9.
Deceased: MarcDitzhaus.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Time-to-event or survival analysis is ubiquitous across medical research, engineering, and social sciences. Trials often involve multiple groups (treatment arms) or factorial designs, creating unique statistical challenges. The primary research focuses not merely on whether any arms differ but specifically on identifying which groups show differences. Thus, traditional global test procedures like ANOVA-type methods, which test null hypotheses of equal hazard ratios or cumulative hazard rate functions, are often inadequate (Konietschke et al. 2012; Ditzhaus et al. 2023). Instead, flexible multiple comparison procedures are crucial in modern data analysis. Current approaches typically employ pairwise multiple log-rank tests with adjustments for multiplicity (e.g., Bonferroni correction) (Logan et al. 2005), but these methods can lack efficiency due to restrictive assumptions about the correlation structure of test statistics (Gao et al. 2008; Gao and Alvo 2008). In recent years, many researchers developed multiple contrast test procedures (MCTPs) along with simultaneous confidence intervals (SCIs) (usually conducted as maximum tests), which are valid for arbitrary correlations of the test statistics and use the correlation within the multiplicity adjustment for various endpoints (means, proportions, Mann–Whitney effects) (Bretz et al. 2001; Schaarschmidt et al. 2009; Hasler and Hothorn 2008; Konietschke et al. 2013; Blanche et al. 2022). Munko et al. (2024) introduced a restricted mean survival time (RMST)-based multiple contrast tests for time-to-event data. Since the RMST should not be employed under crossing hazards (Dormuth et al. 2022, 2023), we aim to close this gap and introduce a powerful and flexible MCTP for analyzing survival data with crossing hazards.
The log-rank test is one of the most prominent test procedures in survival analysis. The method is well known to be optimal when the proportional hazards (PH) assumption is met, but significantly loses power otherwise (Dormuth et al. 2023). Even though the problem is fairly well known, a many investigators of (clinical) trials still ignore the issue and publish their findings upon log-rank tests in leading high-quality peer-reviewed journals, even when the assumption is violated (Kristiansen 2012; Trinquart et al. 2016; Dormuth et al. 2023). For the analysis of two independent samples, weighted log-rank tests and their combinations comprise a great alternative to the classical log-rank test and are beneficial in non-proportional hazards models (Andersen et al. 1993; Fleming and Harrington 1991; Brendel et al. 2014; Ditzhaus and Friedrich 2020). Ditzhaus and Friedrich (2020) propose a Wald-type test of multiple weight functions within a single multivariate test. Which weight function to choose depends on the alternative of interest and cannot be recommended in a general way. However, the test does not provide information on which weight function appears most powerful. For the analysis of more than two samples and factorial designs, Ditzhaus et al. (2023) extended these procedures to the Cumulative Aalen Survival Analysis-of-Variance (CASANOVA) method. In principle, they are global ANOVA-based tests (quadratic forms) and can be used to estimate and test main and interaction effects in general factorial designs. Estimating and testing user-specific contrasts are impossible, limiting their application in statistical practice. To overcome these shortcomings, we propose a novel flexible MCTP. Extensive simulation studies indicate that the test is more powerful under non-proportional hazards and eliminates the need for additional p-value correction. The remainder of the paper is organized as follows. The second section introduces the main statistical methods employed in the analyses. The third section describes the simulation setup and the corresponding results. The following section applies the methods of interest to a real-world data example. The conclusions are drawn in section five, together with future research questions.

2 Set up

Multiple contrasts are faced in many research questions related to time-to-event endpoints. Applying separate tests without adjusting for multiple testing increases the likelihood of false discoveries and inflated error rates. In the following, we present different well-established statistical methods for an underlying multiple contrast problem with time-to-event endpoints, as well as our newly developed method based on a combination of multi-directional log-rank tests and the concept of maximum tests.

2.1 Statistical model

First, we define the underlying statistical model. We consider a study design involving \(k\ge 2\) groups (treatment arms) of \(n_j\) independent subjects, each with time-to-event data \(T_{ji}\) and right-censoring time \(C_{ji}\). The statistical model considered here can be summarized by mutually independent positive random variables \(T_{ji}\sim F_j,\quad \text {and} \quad C_{ji}\sim G_{j}, \quad j=1,\ldots ,k;~ i=1,\ldots ,n_j, \) where \(F_j\) and \(G_j\) are both continuous distribution functions, respectively. Furthermore, let \(X_{ji}=\min (T_{ji},C_{ji})\) denote the observed time and \(\delta _{ji}=I(X_{ji}=T_{ji})\) the censoring status with \(I(\cdot )\) being the indicator function. The statistical model considered here does not entail any parameters but rather the survival distributions that could be used to define reasonable treatment effects. The cumulative hazard rate function for group j is defined by
$$\begin{aligned} A_j(t) = \int \limits _0^t \alpha _j(s) ds, t\ge 0, \, j=1,\ldots ,k, \end{aligned}$$
(1)
with \(\alpha _j\) the hazard rate of group j.
We further assume non-zero sized groups by \(n_j/n^*\rightarrow \kappa _j\in (0,1)\) with \(n^*=\sum _{j=1}^{k} n_j\) as \(\min (n_j:j=1,\ldots ,k)\rightarrow \infty \) and we exclude the case of only censored values within one group by assuming that \(0< F_j(t) < 1\) and \(0< G_j(t) < 1\) \(\forall j= 1,...,k\) and some \(t>0\).

2.2 Multiple null hypotheses

The cumulative hazard rate function of treatment arm j called \(A_j(t)\), summarizes the total accumulated risk of experiencing the event that has been gained by progressing to time t. No difference (i.e., no effect) between treatment arms \(j_1\) and \(j_2\) with \(j_1 \ne j_2\), corresponds to \(A_{j_1}(t)\equiv A_{j_2}(t)\) for all t, or, equivalently, \(A_{j_1} - A_{j_2}\equiv 0\). In the several sample problems, let \( \varvec{H} \in \mathbb {R}^{q\times k}\) be a contrast matrix satisfying \(\varvec{H}\varvec{1_k} = \varvec{0}_q\) with \(\varvec{1_k}\) and \(\varvec{0}_q\) denoting vectors of ones and zeros, respectively. We denote the entries of \(\varvec{H}\) as \({\varvec{h}}_{j_1, j_2}\). For ease of presentation, we describe the pairwise comparisons only. Here, the most prominent matrices are the ones of Dunnett- and Tukey-type. The entries are composed of a single \(-1\) and 1, indicating the two sample comparisons of interest. We define the corresponding index sets \(I_{\text {Dunnett}} = \{(1,2), \ldots , (1,k) \}\) and \(I_{\text {Tukey}} = \{(1,2), \ldots , (1,k), (2,3), \ldots , (2,k), \ldots , (k-1,k) \}\). In the following, we will indicate the position in the matrix or vector by the corresponding indices \(j_1\) and \(j_2\), for example, \((-1, 1, 0, \ldots , 0) = \varvec{h}_{1,2}\) for \(j_1=1\) and \(j_2=2\).
The hypotheses we seek to infer are expressed in relation to the cumulative hazard rate functions as follows:
$$\begin{aligned} {\mathcal {H}}_0&: \varvec{H} \varvec{A} = \varvec{0}_q, \quad \varvec{A}=(A_{1},\ldots ,A_k)^{\top },\\ H_0^{j_1,j_2}&: \{\varvec{h}_{j_1,j_2} \varvec{A} = 0\}, ~(j_1,j_2) \in I, \end{aligned}$$
with \(\varvec{A}^{\top }\) denoting the transposed vector of \(\varvec{A}\) and I being either \(I_{\text {Dunnett}}\) or \(I_{\text {Tukey}}\). In general, the contrast matrix selection depends on the specific question of interest underlying the analysis.

3 Statistical tests

3.1 Adjusted log-rank

As a reference method, we consider the Bonferroni adjusted log-rank test. Therefore, we define the Bonferroni-adjusted significance level \(\alpha _{\text {Bonferroni}} = \alpha / q\) where \(\alpha \) is the original significance level and q is the number of comparisons. The Bonferroni adjustment for multiple comparisons in a survival setting is a standard procedure in clinical settings, as discussed in Logan et al. (2005). The authors have provided a comprehensive description and suggested various methods for adjusting the number of comparisons.
We define the weighted log-rank test as a generalization of the classical log-rank test. Therefore, we employ the conventional counting process notation. Let \( N_j(t)=\sum _{i=1}^{n_j} N_{ji}(t)\) represent the cumulative number of observed events within group j up to time t with \(N_{ji}(t)=I\{X_{ji}\le t, \delta _{ji}=1\} \). Furthermore, we introduce \(Y_j(t)=\sum _{i=1}^{n_j}I\{X_{ji}\ge t\}\), which denotes the number of individuals at risk just before time t in group j. These counting processes enable us to define the Nelson–Aalen estimator for \(A_j\) as \({\widehat{A}}_{j}(t)=\int _{0}^t \frac{I\{Y_j(s)>0\}}{Y_j(s)}\textrm{ d }N_j(s)\) for \(j=1,\ldots ,k\) and \(t\ge 0\). Finally, consider \(n = n_{j_1} + n_{j_2}\) to be the pooled sample size over the two groups of interest.
Then, the weighted log-rank statistic for testing the local null hypothesis \(H^{j_1,j_2}_0: \{{\varvec{h}}_{j_1,j_2} {\varvec{A}} = {\varvec{0}} \} = \{A_{j_1} = A_{j_2} \} \) can be defined as Andersen et al. (1993):
$$\begin{aligned} T_{j_1,j_2}(w)&= T(w, {\varvec{h}}_{j_1, j_2}) \\&= \Big (\frac{n}{n_{j_1} n_{j_2}}\Big )^{1/2} \int \limits _0^\infty w\{ {\widehat{F}}_{j_1, j_2}(t-)\}\frac{Y_{j_1}(t)Y_{j_2}(t)}{Y_{j_1}(t)+Y_{j_2}(t)} \textrm{ d }({\varvec{h}}_{j_1,j_2} \widehat{\varvec{A}}(t))\\&= \Big (\frac{n}{n_{j_1} n_{j_2}}\Big )^{1/2} \int \limits _0^\infty w\{ {\widehat{F}}_{j_1, j_2}(t-)\}\frac{Y_{j_1}(t)Y_{j_2}(t)}{Y_{j_1}(t)+Y_{j_2}(t)} \Big \{\textrm{ d }{\widehat{A}}_{j_2}(t) - \textrm{ d }{\widehat{A}}_{j_1}(t)\Big \}. \end{aligned}$$
Here, \({\widehat{F}}_{j_1, j_2}(t-)\) represents the left-continuous version of the estimator \({\widehat{F}}_{j_1, j_2} = 1 - {\widehat{S}}_{j_1, j_2}\) with \({\widehat{S}}_{j_1, j_2}\) the Kaplan–Meier estimator based on the pooled sample, and w is a continuous weight function and \(\widehat{\varvec{A}} = (\widehat{A}_1, \ldots , \widehat{A}_k)^{\top } \). Fleming and Harrington (1991) examined a specific subclass of weights w given by \(w(t)=t^r(1-t)^g\) \((r,g\in {\mathbb {N}}_0)\). For instance, when \(r=g=0\), the log-rank test is obtained. We derive the individual p-values for the tests from the \(\chi ^2\) distribution and compare them to \(\alpha _{\text {Bonferroni}}\). To make a global statement, we compare the minimal p-value among all local tests to the adjusted significance level.
For practical implementation, we utilize the R package survival and its function survdiff. Therneau (2023)

3.2 Adjusted mdir

In the context of our specific objectives, we are interested in more robust testing procedures towards multiple alternatives. For two-group comparisons, the multi-directional log-rank test has been proposed as a combination procedure of different weighted log-rank tests (Brendel et al. 2014; Ditzhaus and Friedrich 2020). The test assumes the equality of survival under the null hypothesis, with the choice of weights determining the alternative hypothesis. We are particularly interested in weights that intersect the x-axis, such as \(w(t_i) = 1 - 2t_i\), as they are specifically designed to address crossing hazard alternatives.
By default, the R package mdir.logrank (Ditzhaus and Friedrich 2018) implements a combination of the log-rank weight \(w^{(1)}\equiv 1\) and this crossing weight. Dormuth et al. (2023) showed that this default set of weights seems to be robust against multiple alternatives. Nevertheless, if desired, additional weights can be combined to cover more alternative hypotheses. For the general case of m linearly independent weights \(w^{(1)},..., w^{(m)}\), the local test statistic takes a studentized quadratic form:
$$ Z_{j_1, j_2}=(T_{j_1,j_2}(w^{(1)}),..., T_{j_1,j_2}(w^{(m)}))\,\hat{\Sigma }^-_{j_1,j_2}\, (T_{j_1,j_2}(w^{(1)}),..., T_{j_1,j_2}(w^{(m)}))^{\top }. $$
The entries of \(\hat{\Sigma }_{j_1, j_2} \in {\mathbb {R}}^{m \times m}\) are given by
$$\begin{aligned} & (\hat{\Sigma }_{j_1,j_2})_{p,s} \\ & \quad = \frac{n}{n_{j_1}n_{j_2}} \int \limits _{[0,\infty )}w^{(s)}\{{\hat{F}}_{j_1,j_2}(t-)\}w^{(p)}\{{\hat{F}}_{j_1,j_2}(t-)\} \\ & \qquad \frac{Y_{j_1}(t)Y_{j_2}(t)}{Y_{j_1}(t)+Y_{j_2}(t)}d{\hat{A}}_{j_1,j_2}(t), ~~~ (p,s \in \{1,...,m\}), \end{aligned}$$
with \({\hat{A}}_{j_1,j_2}\) the pooled Nelson–Aalen estimator of groups \(j_1\) and \(j_2\). \(\hat{\Sigma }^-_{j_1, j_2}\) represents the Moore–Penrose inverse of the empirical covariance matrix of the weighted log-rank tests. For linearly independent weights \(w^{(1)},\ldots ,w^{(m)}\) fulfilling the assumptions of Ditzhaus et al. (2023) (continuous and of bounded variation), the test statistic \(Z_{j_1, j_2}\) can be assumed to be \(\chi _m^2\) distributed under the null hypothesis. Ditzhaus and Friedrich (2020) also proposed a permutation-based approach. Due to inflated type I errors, we only consider the permutation-based approach.
Again, we employ the Bonferroni adjusted significance level \(\alpha _{\text {Bonferroni}}\) to compare to the obtained local p-values. Analogously to the adjusted log-rank test procedure, we obtain the global test decision by comparing the smallest p-value to the adjusted significance level.

3.3 MultiWeightedLR

Knowing that maximum tests are a common approach for multiple testing problems (Konietschke et al. 2013), a straightforward extension of the weighted log-rank test is to use the maximum over them and exploit the covariance structure between the different tests. We use the same weights as for the adjusted mdir approach without combining them in a quadratic form. Instead, we consider each weighted test individually. After calculating the corresponding covariance matrix, we take the maximum of all weighted test statistics as our global maximum test statistic. Mathematically, we write:
$$\begin{aligned} T_{\max } =\max _{r \in \{1, \ldots , m\}, (j_1,j_2) \in I} (|T_{j_1,j_2}(w^{(r)})|). \end{aligned}$$
For the local testing problem we focus on \( T^{j_1,j_2}_{\max } =\max _{r \in \{1, \ldots , m\}} (|T_{j_1,j_2}(w^{(r)}|))\). Similar to the proof of Theorem 2 in Ditzhaus et al. (2023), it can be shown that the vector \((T_{j_1,j_2}(w^{(r)}))_{r,j_1,j_2}\) is, under regularity conditions, asymptotic centered multivariate normally distributed with covariance matrix \(\Sigma \in {\mathbb {R}}^{m \cdot q \times m \cdot q}\). We thus take the equicoordinate \((1-\alpha )\)-quantile (Konietschke et al. 2012) of this distribution as a critical value to obtain the MultiWeightedLR test in the statistic \(T_{\max }\).

3.4 multiCASANOVA

Ditzhaus et al. (2023) proposed the CASANOVA (Cumulative Aalen Survival Analysis-of-Variance) approach for general factorial designs with right-censored time-to-event data. The core idea of the method is an extension of weighted log-rank tests to the factorial design setup. Therefore, they expanded the combination approach of weighted log-rank tests (mdir) for the two-sample scenario to the general factorial survival designs implemented in the R package GFDsurv (Ditzhaus et al. 2022). For further information, we refer to Ditzhaus et al. (2023)
We aim to extend the CASANOVA approach to allow the estimation and testing of user-specific contrasts in a multiple testing framework. Compared to the aforementioned approaches, the main difference is that we consider pooled quantities over all groups, not only the two groups of interest. To this end, we define a local test statistic for contrast \(\varvec{h}_{j_1, j_2}\) as
$$\begin{aligned} \tilde{T}_{j_1,j_2}(w)&= \tilde{T}(w, \varvec{h}_{j_1, j_2}) \\&= \Big (\frac{n}{n_{j_1} n_{j_2}}\Big )^{1/2} \int \limits _0^\infty w\{ {\widehat{F}}(t-)\}\frac{Y_{j_1}(t)Y_{j_2}(t)}{Y(t)} \textrm{ d }(\varvec{h}_{j_1,j_2} \widehat{\varvec{A}}(t))\\&= \Big (\frac{n}{n_{j_1} n_{j_2}}\Big )^{1/2} \int \limits _0^\infty w\{ {\widehat{F}}(t-)\}\frac{Y_{j_1}(t)Y_{j_2}(t)}{Y(t)} \Big \{\textrm{ d }{\widehat{A}}_{j_2}(t) - \textrm{ d }{\widehat{A}}_{j_1}(t)\Big \}, \end{aligned}$$
where \({\widehat{F}}(t-)\) represents the left-continuous version of the pooled estimator \({\widehat{F}}\) and Y(t) is the total number of individuals at risk over all groups. As in the adjusted mdir test, we combine several weights (still for one single contrast) by considering the corresponding quadratic form given by
$$\begin{aligned} C_{j_1,j_2} = (\tilde{T}_{j_1,j_2}(w^{(1)}),..., \tilde{T}_{j_1,j_2}(w^{(m)}))\widehat{\text {Cov}}_{j_1,j_2}^-\, (\tilde{T}_{j_1,j_2}(w^{(1)}),..., \tilde{T}_{j_1,j_2}(w^{(m)}))^{\top }, \end{aligned}$$
where the inner matrix is defined by
$$\begin{aligned} (\widehat{\text {Cov}}_{j_1,j_2})_{p,s}= \frac{n}{n_{j_1}n_{j_2}} \int \limits _{[0,\infty )}w^{(s)}\{{\hat{F}}(t-)\}w^{(p)}\{{\hat{F}}(t-)\}\frac{Y_{j_1}(t)Y_{j_2}(t)}{Y(t)}d{\hat{A}}(t), ~~~ (p,s=1,...,m) \end{aligned}$$
and \(\widehat{\text {Cov}}_{j_1,j_2}^-\) represents its Moore–Penrose inverse. Similar to the maximum approach within MultiWeightedLR, we now consider the maximum of these Wald-type statistics over all contrasts of interest as the global test statistic
$$\begin{aligned} C_{\max } = \max _{(j_1,j_2) \in I} (C_{j_1,j_2}). \end{aligned}$$
Note that we did not take the maximum over the different weights, as those are already incorporated within the quadratic forms.
We use the common wild bootstrap approach for counting processes in time-to-event analyses (Bluhmki et al. 2019, 2018) to approximate the limiting distribution. Therefore, we consider independent and identically distributed variables \(G_{j1}, \ldots ,G_{jn_j}\) with \(E(G_{ji})=0\) and \(\text {Var}(G_{ji})=1\). The wild bootstrap version of the normalized Nelson–Aalen estimator \({\widehat{A}}_{j} - A_{j}\) as defined in Bluhmki et al. (2019) is then given by:
$$ {\widehat{A}}^*_{j}(t)=\int \limits _{0}^t \frac{I\{Y_j(s)>0\}}{Y_j(s)}\textrm{ d }\left( \sum _{i=1}^{n_j} G_{ji} N_{ji}(s)\right) . $$
The motivation behind \({\widehat{A}}^*_{j}\) stems from the martingale representation of the normalized Nelson–Aalen estimator:
$$\sqrt{n_j}({\widehat{A}}_{j}(t) - A_{j}(t)) \doteq \sqrt{n_j} \int \limits _{0}^t \frac{I\{Y_j(s)>0\}}{Y_j(s)}\textrm{ d }\left( \sum _{i=1}^{n_j} M_{ji}(s)\right) , $$
where \(\doteq \) indicates that the difference between both sides converges to 0 in probability, and \(M_{ji}\) denotes the martingale obtained from the Doob–Meyer decomposition of \(N_{ji}\) with\(M_{ji}(s)=N_{ji}(s)-\int _{0}^s I \{X_{ji}\ge u\}dA_j(u)\). The wild bootstrap Nelson–Aalen version \({\widehat{A}}^*_{j}(t)\) is thus obtained by replacing the unobservable martingales \(M_{ji}\) with the observable \(G_{ji} N_{ji}\). As shown in Bluhmki et al. (2019), the distribution of \(\sqrt{n_j}({\widehat{A}}_{j} - A_{j})\) and the conditional distribution of \(\sqrt{n_j}{\widehat{A}}^*_{j}\) (given the data) coincide asymptotically. Assuming \(\lim n_j/n \in (0,1)\), this implies that \(\sqrt{n} ({\widehat{A}}^{*}_{j_2}(t) -{\widehat{A}}^{*}_{j_1}(t))\) can be used to approximate the \(H_0^{j_1,j_2}\)-null distribution of \(\sqrt{n}({\widehat{A}}_{j_2}(t)-{\widehat{A}}_{j_1}(t))\). We thus define the wild bootstrap versions \(\tilde{T}^*_{j_1,j_2}\) and \(C^*_{j_1,j_2}\) of \(\tilde{T}_{j_1,j_2}\) and \(C_{j_1,j_2}\), respectively, by replacing \({\widehat{A}}_{j_2}(t)-{\widehat{A}}_{j_1}(t)\) with \({\widehat{A}}^{*}_{j_2}(t) -{\widehat{A}}^{*}_{j_1}(t)\).
Since counting processes are discrete, we opt for discrete distributions for the \(G_{ji}\). We focus on two common choices: (i) the Rademacher distribution (Liu 1988), and (ii) the centered Poisson distribution (Mammen 2012). This results in two different wild bootstrap quantiles depending on the distribution of choice: \(q_{\alpha }^*\) the \(\alpha \)-quantile of \(C_{\max }\) given our data \((X_{ji}, \delta _{ji})\). Then we obtain the global test decision by evaluating \(C_{\max } > q^*_{\alpha }\) and the local test decisions by \(C_{j_1,j_2} > q^*_{\alpha }\).

4 Simulation study

We conducted an extensive simulation study in R 4.4.0 (R Core Team 2021) to evaluate the rejection rate and the power performance of the candidate methods.

4.1 Simulation setup

We simulated data for \(k=4\) groups considering the Tukey- and Dunnett-type contrast matrices. We considered four scenarios, each with different distribution functions. Each represents a specific case of hazard relationships, such as (i) proportional hazards, (ii) non-proportional and non-crossing hazards, (iii) crossing hazards, and (iv) a mixed scenario. The specific survival functions are presented in Table 1. We set the group size for each scenario to 100; the censoring rates vary between \(0\%\) and \(30\%\) with uniform censoring. The work of Dormuth et al. (2023) indicated that the choice of censoring distribution does not have a major impact on the performance of statistical tests. Considering all possible combinations of censoring, survival distributions, and contrast matrices, we end up with a total of 4(scenarios) \(\times \) 1120 parameter combinations = 4480 different settings.
This is because we only considered the combination of different survival time distributions for the individual groups, but we did not consider the order in which they were combined. This means that \(S_1, S_1, S_1, S_2\) is the same combination as \(S_1, S_1, S_2, S_1\). For the Tukey-type contrast matrices, we considered every possible comparison for \(k=4\) that results in six tests. For the Dunnett-type contrast matrices, we compared the first group to all the other groups, resulting in 3 contrasts.
10, 000 simulation runs with 1000 resampling iterations were performed for each setting. The global level of significance was set to 0.05 throughout.

4.2 Simulation results under the null hypothesis

Figure 1 illustrates the familywise error rate (FWER) for all survival scenarios for the different contrast matrix types. We set the \(\alpha \)-level to \(5\%\). The dashed lines represent the corresponding binomial precision interval, based on the 10,000 simulation runs.
Fig. 1
FWER under \({\mathcal {H}}_0\) for all settings for the Dunnett-type (left) and Tukey-type (right) contrast matrices. The dashed lines represent the borders of the binomial precision interval \([4.57\%, 5.43\%]\)
Full size image
For both contrast matrices, almost all methods control the FWER well. The adjusted mdir is the only test that is a little liberal when comparing the median to the global \(\alpha \)-level of \(5\%\) for the Dunnett-type matrix. The new multiple-testing approaches are more conservative than the adjusted approaches, especially for the Tukey-type contrast matrices, with the multiWeightedLR being the most conservative.
Table 1
Simulation scenarios
Scenario
CDF
Visualization of the survival and hazard curves
Prop
\(F_1(t) = \textit{Exponential}(1.2)\)
\( F_2(t) = \textit{Exponential}(1.8)\)
\( F_1(t) = \textit{Exponential}(2.3) \)
\( F_2(t) = \textit{Exponential}(2.9) \)
https://static-content.springer.com/image/art%3A10.1007%2Fs10985-025-09676-9/MediaObjects/10985_2025_9676_Figa_HTML.gif
NProp
\( F_1(t) = \textit{Lognormal}(2.2, 1.7)\)
\( F_2(t) = \textit{Lognormal}(2.6, 1.6)\)
\( F_3(t) = \textit{Lognormal}(3.5, 1.7)\)
\(F_4(t) = \textit{Lognormal}(4.5, 1.6)\)
https://static-content.springer.com/image/art%3A10.1007%2Fs10985-025-09676-9/MediaObjects/10985_2025_9676_Figb_HTML.gif
Cross
\( F_1(t) = \textit{Weibull}(1.5, 5)\)
\(F_2(t) = \textit{Weibull}(2.5, 5)\)
\(F_3(t) = \textit{Weibull}(3.5, 5)\)
\(F_4(t) = \textit{Weibull}(4.5, 5)\)
https://static-content.springer.com/image/art%3A10.1007%2Fs10985-025-09676-9/MediaObjects/10985_2025_9676_Figc_HTML.gif
Mix
\( F_1(t) = \textit{Lognormal}(2.3, 1.7) F_2(t) = \textit{Exponential}(0.05)\)
\(F_2(t) = \textit{Exponential}(0.05)\)
\(F_3(t) = \textit{Weibull}(2.4, 11.7)\)
\(F_4(t) = \textit{Lognormal}(3, 1.6) \)
https://static-content.springer.com/image/art%3A10.1007%2Fs10985-025-09676-9/MediaObjects/10985_2025_9676_Figd_HTML.gif

4.3 Simulation results under the alternative hypothesis

We focused on the local decisions under the alternative hypothesis to assess the power. Figures 2 and 3 illustrate the rejection rates when different survival distributions are present. Each figure consists of four subfigures, one for each scenario. It should be noted that a higher number of tests decreases the power of each local hypothesis. This property is visible in the plots, showing generally higher power for the Dunnett plots than the Tukey plots. Besides that, the tests behave similarly for both contrast matrix types. The adjusted log-rank test is the most powerful in the setting with proportional hazards, while the other tests perform equally well. Under non-proportional but non-crossing hazards, all tests have a high power, with the new approaches yielding a slightly lower variability. In the crossing scenario, the log-rank test loses power drastically due to violating the PH assumption. The four approaches designed for nPH data have high power, with the multiWeightedLR being slightly less powerful than the other three tests. In the mixed setting, the adjusted mdir performs best in terms of power, followed by the three methods introduced in this paper. The log-rank test has the highest variability and the lowest median power.
The rejection rates for the local tests with no difference in survival are depicted in Figures S1 and S2 in the Supplemental Material. Overall, the rejection rates among the approaches are similar, with lower rejection rates for the Tukey-type contrast matrices. Additionally, the power for each local test is provided in the tables in the Supplemental Material.
The adjusted mdir test performs best for two of the four settings considered. Considering that it showed slightly liberal behavior under the null hypothesis, these results should be interpreted carefully. The methods introduced in this publication yield robust results regarding power among the different scenarios. The adjusted log-rank test loses power dramatically in the scenario with crossing hazards.
Fig. 2
Local power over all tests under the alternative for Dunnett-type contrasts for all four scenarios (each boxplot contains 1136 data points)
Full size image
Fig. 3
Local power over all tests under the alternative for Tukey-type contrasts for all four scenarios (each boxplot contains 2016 data points)
Full size image
In the Supplement (Figures S3–S5), we present an additional analysis of the behavior of the different tests regarding the FWER and power for smaller sample sizes (\(n=50\)). The results indicate that the multiWeightedLR approach exhibits an inflated FWER, likely due to the normal approximation. In contrast, both multiCASANOVA bootstrap approaches maintain strong control over the family-wise error rate and consistently deliver good results in terms of power. The adjusted LR and mdir test still control the FWER but show increased variability in terms of power.

5 Illustrative data example

To illustrate the novel approaches on real-world data, we used publicly available data from the CoMMpass study (dbGaP accession: phs000748.v4.p3). This study is designed to associate clinical outcome with genetic profiles and contains longitudinal clinical and molecular data from multiple myeloma (MM) patients. Based on the transcriptional profile and the expression level of biologically relevant core machinery that plays a vital role in the stress response (Heynen et al. 2023), we clustered MM patients into seven groups. Figure 4 shows the Kaplan–Meier curves of the seven different groups. We assume that we are interested in comparing every group with one another and consider a Tukey design. The significance level was set to \(\alpha = 0.05\) with a total of 21 tests. The corrected significance level is thus \(\alpha _{\text {Bonferroni}} = 0.0024\).
Fig. 4
Kaplan–Meier plot of the seven treatment groups of patients with multiple myeloma (MM)
Full size image
By examining the survival curves, we anticipate that the methods will identify a significant overall difference among the groups. Specifically, we expect group 2 to differ from the other groups. However, we do not expect to see any differences among groups 3, 4, and 5. We applied all testing procedures described in this paper to investigate these premises. For all approaches combining multiple weighted log-rank tests, we included the \(w^{(1)}(t) = 1\) and \(w^{(2)}(t) = 1 - 2t\). We set the number of resampling iterations to 1000 for all resampling-based approaches.
The detailed results are listed in the Supplemental Material Table S1. The adjusted log-rank and mdir test detected six significant differences between groups, while the three new methods only detected five. The found differences are consistent among the methods. All tests found the pair-wise differences between groups two and three, four, five, and seven, as well as between groups one and four, to be significant. A significant result for the comparison between groups two and six was only found by the adjusted LR test and the adjusted mdir test.
All tests could reject the global hypothesis of any difference between groups as well. In summary, we could show that in the case of a real-world application, the results are consistent with the results of the adjusted log-rank test.

6 Discussion

We explored various statistical methods for addressing multiple contrast problems with time-to-event endpoints, including traditional and newly developed approaches. To assess the approaches’ performance, we compared the Family-Wise Error Rate (FWER) control and the power performance of these methods under different survival scenarios. The results of our simulation study and real-world data application provide valuable insights into the strengths and limitations of each approach.
Most methods maintain adequate control of the FWER. The adjusted mdir test exhibited a slightly liberal behavior, particularly for Dunnett-type contrasts. This deviation suggests that while the adjusted mdir test might be powerful, it occasionally exceeds the acceptable error rate, which warrants caution in its interpretation under null conditions. On the other hand, the multiWeightedLR and multiCASANOVA methods were generally more conservative, particularly for Tukey-type contrast matrices. This conservativeness could imply a lower risk of Type I errors but may come at the cost of reduced statistical power.
Under alternative hypotheses, the power analysis revealed notable differences in the tests’ performance depending on the survival scenario. For proportional hazards, the adjusted log-rank test showed the highest power, outperforming the other methods. Under non-proportional and non-crossing hazards, we could observe high power among all tests, with the new approaches showing slightly lower variability. The robustness of these methods suggests that they are suitable choices when proportional hazards are not guaranteed. The log-rank test’s power decreased drastically in the specific case of crossing hazards. In contrast, the four approaches specifically designed for non-proportional hazards (adjusted mdir, multiWeightedLR, and the two multiCASANOVA variants) maintained high power, confirming their utility in these settings. Finally, the adjusted mdir test outperformed other methods in the mixed scenario, achieving the best power performance. Although slightly less powerful, the new methods provided more consistent results across different scenarios, highlighting their robustness.
The results suggest potential areas for further methodological improvements. While the Bonferroni correction is widely used for controlling type I error rates, its conservative nature may result in lower power, particularly in settings with many comparisons. More sophisticated adjustment techniques, like the Holm procedure, could better balance error rate control and power, as discussed in previous studies.
Additionally, evaluating the performance of these methods in unbalanced designs could provide a more comprehensive understanding of their behavior in practical applications. This could be particularly interesting since (Munko et al. 2024) showed that such conditions could boost the power of specific local designs.
In the illustrative data example involving patients with multiple myeloma, the new methods produced consistent results with those obtained from the adjusted log-rank tests. Although the novel approaches identified less significant differences than the traditional methods, their findings were largely aligned, underscoring their reliability in practical scenarios. This consistency and the conservative behavior in terms of FWER control suggest that the new methods still offer a robust alternative for analyzing time-to-event data in clinical studies. Future research would include more efficient exploitation of the FWER for the new approaches, e.g., by incorporating closed testing approaches. In general, it is essential to critically assess whether a higher number of statistically significant results truly reflects a superior testing approach, as statistical significance does not inherently equate to clinical relevance.

Acknowledgements

We sincerely thank the reviewers for their insightful feedback and thoughtful recommendations, which have significantly strengthened this work.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Download
Title
Beyond Bonferroni: new multiple contrast tests for time-to-event data under non-proportional hazards
Authors
Ina Dormuth
Carolin Herrmann
Frank Konietschke
Markus Pauly
Matthias Wirth
Marc Ditzhaus
Publication date
01-03-2026
Publisher
Springer US
Published in
Lifetime Data Analysis / Issue 1/2026
Print ISSN: 1380-7870
Electronic ISSN: 1572-9249
DOI
https://doi.org/10.1007/s10985-025-09676-9

Supplementary Information

Below is the link to the electronic supplementary material.
go back to reference Andersen PK, Borgan O, Gill RD, Keiding N (1993) Statistical models based on counting processes. Springer, Berlin
go back to reference Blanche P, Dartigues JF, Riou J (2022) A closed max-t test for multiple comparisons of areas under the ROC curve. Biometrics 78(1):352–363MathSciNetCrossRef
go back to reference Bluhmki T, Dobler D, Beyersmann J, Pauly M (2019) The wild bootstrap for multivariate Nelson–Aalen estimators. Lifetime Data Anal 25:97–127MathSciNetCrossRef
go back to reference Bluhmki T, Schmoor C, Dobler D, Pauly M, Finke J, Schumacher M, Beyersmann J (2018) A wild bootstrap approach for the Aalen–Johansen estimator. Biometrics 74(3):977–985MathSciNetCrossRef
go back to reference Brendel M, Janssen A, Mayer CD, Pauly M (2014) Weighted Logrank permutation tests for randomly right censored life science data. Scand J Stat 41(3):742–761. https://doi.org/10.1111/sjos.12059MathSciNetCrossRef
go back to reference Bretz F, Genz A, A Hothorn L (2001) On the numerical availability of multiple comparison procedures. Biometrical J 43(5):645–656.
go back to reference Ditzhaus M, Dobler D, Pauly M, Steinhauer P, Munko M (2022) GFDsurv: tests for survival data in general factorial designs. R package version 0.1.1. https://CRAN.R-project.org/package=GFDsurv
go back to reference Ditzhaus M, Friedrich S (2018) mdir.logrank: Multiple-direction logrank test . https://CRAN.R-project.org/package=mdir.logrank. R package version 0.0.4
go back to reference Ditzhaus M, Friedrich S (2020) More powerful Logrank permutation tests for two-sample survival data. J Stat Comput Simul 90(12):2209–2227MathSciNetCrossRef
go back to reference Ditzhaus M, Genuneit J, Janssen A, Pauly M (2023) CASANOVA: permutation inference in factorial survival designs. Biometrics 79(1):203–215MathSciNetCrossRef
go back to reference Dormuth I, Liu T, Xu J, Pauly M, Ditzhaus M (2023) A comparative study to alternatives to the log-rank test. Contemp Clin Trials 128:107165CrossRef
go back to reference Dormuth I, Liu T, Xu J, Yu M, Pauly M, Ditzhaus M (2022) Which test for crossing survival curves? A user’s guideline. BMC Med Res Methodol 22(1):1–7CrossRef
go back to reference Fleming TR, Harrington DP (1991) Counting processes and survival analysis, vol. 625. Wiley, London
go back to reference Gao X, Alvo M (2008) Nonparametric multiple comparison procedures for unbalanced two-way layouts. J Stat Plan Inference 138(12):3674–3686MathSciNetCrossRef
go back to reference Gao X, Alvo M, Chen J, Li G (2008) Nonparametric multiple comparison procedures for unbalanced one-way factorial designs. J Stat Plan Inference 138(6):2574–2591MathSciNetCrossRef
go back to reference Hasler M, Hothorn LA (2008) Multiple contrast tests in the presence of heteroscedasticity. Biometrical J Math Methods Biosci 50(5):793–800MathSciNet
go back to reference Heynen GJ, Baumgartner F, Heider M, Patra U, Holz M, Braune J, Kaiser M, Schäffer I, Bamopoulos SA, Ramberger E et al (2023) SUMOylation inhibition overcomes proteasome inhibitor resistance in multiple myeloma. Blood Adv 7(4):469–481CrossRef
go back to reference Konietschke F, Bösiger S, Brunner E, Hothorn LA (2013) Are multiple contrast tests superior to the ANOVA? Int J Biostat 9(1). https://doi.org/10.1515/ijb-2012-0020
go back to reference Konietschke F, Hothorn LA, Brunner E (2012) Rank-based multiple test procedures and simultaneous confidence intervals. Electron J Stat 6. https://doi.org/10.1214/12-EJS691
go back to reference Kristiansen IS (2012) PRM39 survival curve convergences and crossing: a threat to validity of meta-analysis? Value Health 15(7):A652MathSciNetCrossRef
go back to reference Liu RY (1988) Bootstrap Procedures under some non-iid Models. Ann Stat 16(4):1696–1708CrossRef
go back to reference Logan BR, Wang H, Zhang MJ (2005) Pairwise multiple comparison adjustment in survival analysis. Stat Med 24(16):2509–2523MathSciNetCrossRef
go back to reference Mammen E (2012) When does Bootstrap work?: asymptotic results and simulations, vol. 77. Springer, Berlin
go back to reference Munko M, Ditzhaus M, Dobler D, Genuneit J (2024) RMST-based multiple contrast tests in general factorial designs. Stat Med 43(10):1849–1866. https://doi.org/10.1002/sim.10017MathSciNetCrossRef
go back to reference R Core Team: R: a language and environment for statistical computing (2021). https://www.R-project.org/
go back to reference Schaarschmidt F, Biesheuvel E, Hothorn LA (2009) Asymptotic simultaneous confidence intervals for many-to-one comparisons of binary proportions in randomized clinical trials. J Biopharm Stat 19(2):292–310MathSciNetCrossRef
go back to reference Therneau TM (2023) A package for survival analysis in R . https://CRAN.R-project.org/package=survival. R package version 3.5-5
go back to reference Trinquart L, Jacot J, Conner SC, Porcher R (2016) Comparison of treatment effects measured by the hazard ratio and by the ratio of restricted mean survival times in oncology randomized controlled trials. J Clin Oncol 34(15):1813–1819CrossRef

Premium Partner

    Image Credits
    Salesforce.com Germany GmbH/© Salesforce.com Germany GmbH, IDW Verlag GmbH/© IDW Verlag GmbH, Diebold Nixdorf/© Diebold Nixdorf, Ratiodata SE/© Ratiodata SE, msg for banking ag/© msg for banking ag, C.H. Beck oHG/© C.H. Beck oHG, Governikus GmbH & Co. KG/© Governikus GmbH & Co. KG, Horn & Company GmbH/© Horn & Company GmbH, EURO Kartensysteme GmbH/© EURO Kartensysteme GmbH, Jabatix S.A./© Jabatix S.A.