Skip to main content
Top

Assessing delayed treatment benefits of immunotherapy using long-term average hazard: a novel test/estimation approach

  • Open Access
  • 14-10-2025
Published in:

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This article introduces a novel test/estimation approach using long-term average hazard (LT-AH) to assess delayed treatment benefits of immunotherapy. The study addresses the limitations of conventional log-rank/hazard-ratio tests, which assume proportional hazards, often violated in immunotherapy trials. The proposed LT-AH method focuses on the intensity of event occurrence in a later study time window where the survival benefit of immunotherapy appears. Numerical studies demonstrate that LT-AH-based tests are more powerful than standard average hazard (AH)-based tests in detecting delayed treatment effects. The article also provides a real-world example using data from the CheckMate 214 study, comparing nivolumab plus ipilimumab with sunitinib in patients with advanced renal cell carcinoma. The results show that the LT-AH approach can capture long-term benefits more effectively than conventional methods. The study concludes that LT-AH is a robust framework for estimating the magnitude of between-group differences, particularly in trials where delayed onset of treatment benefits is anticipated.

Supplementary Information

The online version contains supplementary material available at https://doi.org/10.1007/s10985-025-09671-0.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

In recent years, there have been many immunotherapy clinical trials in cancer clinical research. According to ClinicalTials.gov, as of October 1, 2023, 86 phase III clinical trials were actively recruiting patients to evaluate checkpoint inhibitor immunotherapies. Generally, in late-stage cancer clinical trials, the primary outcomes for evaluating the treatment effect of investigational therapies are time-to-event outcomes such as overall survival and progression-free survival (PFS), and the conventional analytical approach uses the log-rank test for statistical comparison and Cox’s hazard ratio to estimate the magnitude of the treatment effect. On the other hand, various authors have argued that this conventional log-rank/hazard-ratio test/estimation approach is not optimal for immunotherapy trials because many empirical studies have shown that the proportional hazards assumption does not hold in these trials (Chen 2013; Xu et al. 2017; Alexander et al. 2018; Huang 2018). The pattern of difference that is often observed in immunotherapy trials is the so-called ‘delayed difference.’ Fig. 1A illustrates an example of the delayed difference pattern. It shows the Kaplan-Meier curves of PFS data from a randomized controlled trial that compared nivolumab plus ipilimumab to sunitinib in patients with advanced renal cell carcinoma (Motzer et al. 2018). In this trial, the two PFS curves were almost identical from time 0 to 7 months, but a benefit of immunotherapy appeared after 7 months. Of course, if the sample size of the study is not so large, it would be possible for two estimated survival curves to have a delayed difference pattern even when the sample is from a proportional hazards data generation process. However, given the empirical evidence from previous immunotherapy trials, it would not be best practice to use the conventional log-rank/hazard-ratio test/estimation approach, which relies on the proportional hazards assumption.
Fig. 1
Estimated Kaplan–Meier curves of progression-free survival time (A) and the area under the curves on the time range [7, 21] (months) (B) with the data reconstructed from the publication of the CheckMate 214 study
Full size image
Fig. 2
Survival functions of event time distributions of the treatment group (solid line) and control group (dashed line) used in the numerical studies. A no difference; B proportional hazards difference; C delayed difference I; D delayed difference II; E delayed difference III
Full size image
Fig. 3
Survival functions of censoring time distributions used in the numerical studies. A no censoring; B light censoring; C moderate censoring
Full size image
Here, we briefly describe some more general considerations with respect to the conventional log-rank/hazard-ratio test/estimation approach. First, this conventional approach exhibits test/estimation coherency (Uno et al. 2020). The statistical significance derived from the test result will not contradict whether the confidence interval for the hazard ratio covers the null value or not. Since the qualitative conclusion derived from the test and the quantitative summary derived from the interval estimation are coherent, this conventional approach will not generate confusing results for decision-making.
Regarding the testing part of the conventional test/estimation approach, the log-rank test and the partial likelihood-based tests (such as the score test and the Wald test) for the hazard ratio are asymptotically valid tests regardless of the adequacy of the proportional hazards assumption. Also, if the proportional hazards assumption holds, these tests are asymptotically the most powerful tests. On the other hand, when the proportional hazards assumption does not hold, they are not optimal (Fleming and Harrington 1991). For the delayed difference pattern, for example, other tests will offer higher power than these.
Regarding the estimation part of the conventional test/estimation approach, Cox’s method provides an efficient and elegant way to summarize the magnitude of the treatment effect as a hazard ratio. It does not require a distribution assumption for survival time in each group. Also, it does not require estimation of the group-specific absolute hazard to calculate the hazard ratio; it can directly estimate the hazard ratio by imposing the proportional hazards assumption. However, this approach also has several limitations. One notable limitation is that, when the proportional hazards assumption does not hold, Cox’s hazard ratio depends on the underlying study-specific censoring time distribution, and therefore the interpretation of the resulting hazard ratio is not obvious (Kalbfleisch and Prentice 1981; Lin and Wei 1989; Uno et al. 2014; Horiguchi et al. 2019). Another notable limitation is the lack of group-specific absolute hazards. Although this may be an advantage from a statistical point of view, it can be a disadvantage from a practical point of view. The same hazard ratio will have a different clinical implication if the absolute hazard in the control group is different. Therefore, the group-specific absolute hazards will be necessary information for clinicians to assess if the resulting hazard ratio indicates a clinically significant magnitude of the treatment effect. In fact, some guidelines recommend presenting the magnitude of a treatment effect in both absolute difference and relative terms for better clinical interpretation. For example, in the CONSORT 2010 guideline, Subitem 17b says “When the primary outcome is binary, both the relative effect (risk ratio (relative risk) or odds ratio) and the absolute effect (risk difference) should be reported (with confidence intervals), as neither the relative measure nor the absolute measure alone gives a complete picture of the effect and its implications.” (Cobos-Carbo and Augustovski 2011) While Subitem 17b is specific for binary outcomes, this recommendation also applies to time-to-event outcomes. Similarly, the General Statistical Guidance provided by the Annals of Internal Medicine provides comparable guidance for time-to-event outcomes: “Presenting estimates of effect in both absolute and relative terms increases the likelihood that results will be correctly interpreted.” (Annals of Internal Medicine, 2025) Unfortunately, except for very specific cases, it is difficult to transform the hazard ratio from Cox’s proportional hazards model into the absolute difference in hazard.
There are many alternative approaches to address these limitations of the conventional approach. With respect to power loss under non-proportional hazards scenarios in the testing part of the conventional approach, one popular alternative is to use a test from a class of weighted log-rank tests. For example, for the delayed separation pattern shown in Fig. 1A, \(G^{0, 1}\) test in the \(G^{\rho , \gamma }\) class (Fleming and Harrington 1991) will be more powerful than the ordinal log-rank test (i.e., \(G^{0, 0}\)). The piecewise weighted log-rank test proposed by Xu et al. (2017) is another alternative when a delayed separation pattern is expected. This test will give zero weight to the early time window \((0, \eta )\) where two survival curves seem identical.
Although a delayed separation pattern has been observed in most immunotherapy trials, this may not always be the case. To address the possible misidentification of the pattern of difference between two survival curves, robust (or versatile) tests using several weighted log-rank test statistics have been proposed (Tarone and Ware 1977; Fleming and Harrington 1991) to capture various patterns of difference. For example, a cross-pharma working group (Roychoudhury et al. 2021) for clinical trials with non-proportional hazards recommended the so-called Max-combo test, which uses the maximum of the test statistics of \(G^{0,0},\) \(G^{1,0},\) \(G^{0,1}\) and \(G^{1,1}\) as the test statistic.
These alternatives using weighted log-rank tests will provide higher power than the conventional log-rank test for the trials where the appearance of a delayed treatment benefit is expected. However, one of the drawbacks of these is the estimation of the magnitude of the treatment effect. The treatment effect summary measure that corresponds to a weighted log-rank test will be an hazard ratio-type measure (León et al. 2020) and thus has the same issues as Cox’s hazard ratio as described.
In addition to weighted log-rank approaches, a wide array of statistical methods has been developed to address non-proportional hazards, with a comprehensive review of these methods detailed in the work of Bardo et al. (2024). In this paper, however, we focus on a specific class of alternative approaches that seek to overcome the limitation regarding not only non-proportional hazards but also the limitation regarding the lack of group-specific absolute hazards. Specifically, the class of alternative approaches we are interested in here uses summary measures of the event time distribution for each group. This allows the derivation of control measures that quantify between-group differences, encompassing both absolute and relative metrics (Uno et al. 2014; Chappell and Zhu 2016). For example, the cumulative incidence probability at a specific time point or the restricted mean survival time (RMST) with a specific truncation time point (Royston and Parmar 2011; Uno et al. 2014, 2015; A’Hern 2016; Chappell and Zhu 2016; Péron et al. 2016; Saad et al. 2018) can be used as the summary measure of the event time distribution. Median survival time can also be used if it is estimable in both groups. Of these, the approach using RMST is gaining more attention currently. Randomized trials where the RMST-based approach was used for the analysis can be found in the literature (Guimarães et al. 2020; Connolly et al. 2022; Hammad et al. 2022; Sanchis et al. 2023). However, the RMST-based approach also has a limitation. Unfortunately, the statistical comparison based on RMST provides lower power than the conventional log-rank test when detecting delayed treatment effects (Tian et al. 2018).
There are two directions one could take in order to address the power issue of the RMST-based approach under delayed difference patterns. Let S(t) be the survival function of the event time T. The standard RMST integrates the survival function with respect to time from time 0 to a specific truncation time point \(\tau ,\) which can be denoted by \(\int _0^{\tau } S(u)du.\) One direction to improve the power to detect a delayed treatment effect is to modify the time range in the calculation of the RMST as follows, \(\int _{\eta }^{\tau } S(u)du,\) where \(\eta \) will be a positive constant and near the time point where two survival curves start to separate. As shown in Fig. 1A, the two survival curves are almost identical for the first 6 to 7 months. Because \(\int _{0}^{\eta } S(u)du\) provides mainly noise rather than a signal for statistical comparison between two groups, using \(\int _{\eta }^{\tau } S(u)du\) will improve power. In this paper, to distinguish this from the standard RMST, we call this Long-Term RMST. A non-parametric inference procedure for this measure was given by Zhao et al. (2011), and an application to an immunotherapy study is found in Horiguchi et al. (2018). Recently, Paukner and Chappell (2021) also introduced this approach, calling it Window MST.
Another direction to address the power issue of the standard RMST approach in detecting delayed treatment effects is to modify the summary measure. Recently, Uno and Horiguchi (2023) proposed a new summary measure of the event time distribution called “average hazard with survival weight” and nonparametric inference procedure for the difference and ratio of this metric. The average hazard with survival weight is defined as
$$\begin{aligned} \frac{\int _{0}^{\tau } h(u)S(u)du}{\int _{0}^{\tau } S(u)du} = \frac{1-S(\tau )}{\int _{0}^{\tau } S(u)du} = \frac{E\{I(T\le \tau )\}}{E\{ T \wedge \tau \}}, \end{aligned}$$
(1)
where h(u) is the hazard function for the event time TI(A) is an indicator function for the event A,  and \(x \wedge y \) denotes \(\min (x,y).\) Looking at the last term in the equation (1), this quantity can also be viewed as the ratio of the expected total number of events we observe by \(\tau \) and the expected total observation time by \(\tau \) when there is no censoring until \(\tau .\) In the Epidemiology textbook, the person-time incidence rate is defined by the ratio of the total number of observed events and the total person-time of exposure (Rothman et al. 2008). When a Poisson model is true (that is, the event time follows an exponential distribution), this is the maximum likelihood estimator for the rate parameter of the Poisson model. However, if the Poisson model is not correct, it will not converge to a population quantity solely related to the event time distribution of T,  but one related to both the event time and the censoring time distributions (Uno and Horiguchi 2023). While the average hazard with survival weight defined by (1) is not a random quantity but a population parameter, it can be interpreted as the average person-time incidence rate of T on \(t \in [0,\tau ]\) when all T before \(\tau \) would have been observed without being censored by the study-specific censoring time. From this point forward, the average hazard with survival weight is simply referred to as the average hazard (AH). Numerical studies conducted by Uno and Horiguchi (2023) showed that AH-based tests can be more powerful than the log-rank test and the standard RMST-based tests in detecting delayed treatment effects, while they are less powerful for early difference scenarios.
In this paper, we seek to gain further power improvement in detecting a delayed treatment effect by combining these two directions of work. In Sect. 2, we propose a long-term average hazard (LT-AH) focusing on the intensity of event occurrence in a later study time window where the survival benefit of immunotherapy appears. We show nonparametric inference procedures for the ratio of LT-AH and the difference in LT-AH. In Sect. 3, we conduct numerical studies to assess the performance of the proposed approach in finite sample size situations, compared to other methods. In Sect. 4, we apply the proposed method to data from the aforementioned randomized study report that compared nivolumab plus ipilimumab with sunitinib in patients with advanced renal cell carcinoma (Motzer et al. 2018). Remarks and conclusions are given in Sect. 5.

2 Method

2.1 Long-term average hazard

Let \(T_k\) be a continuous non-negative random variable to denote the event time for group \(k \ (k=0,1)\). Let \(C_k\) denote the censoring time for group k. Assume that \(T_k\) is independent of \(C_k.\) Let \(\left\{ (T_{ki},C_{ki}); \ i=1,\ldots ,n_k \right\} \) denote independent copies from \((T_k, C_k).\) Let \(X_{ki} = \min (T_{ki}, C_{ki})\) and \(\varDelta _{ki}=I(T_{ki} \le C_{ki}).\) The observable data from group k are then denoted by \(\left\{ (X_{ki},\varDelta _{ki}); \ i=1,\ldots , n_k \right\} .\) We assume \(p_k = \lim _{n \rightarrow \infty } n_{k}/n > 0 \) for \(k=0,1,\) where \(n=n_1+n_0.\)
Let \(h_k(\cdot )\) and \(w_k(\cdot )\) be the hazard function for \(T_k\) and a continuous nonnegative function, respectively. A general form of the weighted average of the hazard function over a given time range \([\tau _{1}, \tau _{2}]\) \((0 \le \tau _{1} < \tau _{2})\) is denoted by
$$\begin{aligned} \eta _k(\tau _{1}, \tau _{2}) = \frac{ \int _{\tau _{1}}^{\tau _{2}} h_k(u)w_k(u) du}{\int _{\tau _{1}}^{\tau _{2}} w_k(u) du}. \end{aligned}$$
Here, we use the survival function \(S_k(t)\) for the weight function \(w_k(t)\) as proposed by Uno and Horiguchi (2023). We assume that \(F_k(\tau _{2})> F_k(\tau _{1}) \ge 0\) and \(R_k(\tau _{2})-R_k(\tau _{1})>0,\) for \(k=0,1,\) where \(F_k(\tau ) = 1-S_k(\tau )\) is the cumulative incidence probability at \(\tau \) and \(R_k(\tau ) = \int _{0}^{\tau }S_k(u) du\) is the RMST with the truncation time \(\tau .\) The average hazard with survival weight over the time range \([\tau _{1}, \tau _{2}]\) is then denoted by
$$\begin{aligned} & \eta _k(\tau _{1}, \tau _{2}) = \frac{ F_k(\tau _{2}) - F_k(\tau _{1}) }{R_k(\tau _{2}) - R_k(\tau _{1})} \nonumber \\ & \quad = \frac{ E\{ I(\tau _1 < T_k \le \tau _2) \} }{ E (T_k \wedge \tau _2) - E (T_k \wedge \tau _1)}. \end{aligned}$$
(2)
The numerator of the right hand side of the Eq. (2) denotes the probability of having an event between \((\tau _1, \tau _2)\), and the denominator is the expected time of being alive between \((\tau _1, \tau _2).\) Thus, \(\eta _k(\tau _{1}, \tau _{2})\) can be interpreted as an average intensity of having an event over the time window \((\tau _1, \tau _2).\) In this paper, we call this quantity the long-term average hazard (LT-AH) hereafter to distinguish it from the AH.
A natural non-parametric estimator for the LT-AH is
$$\begin{aligned} {\hat{\eta }}_k(\tau _{1}, \tau _{2}) = \frac{ {\hat{F}}_k(\tau _{2}) - {\hat{F}}_k(\tau _{1}) }{{\hat{R}}_k(\tau _{2}) - {\hat{R}}_k(\tau _{1})}, \end{aligned}$$
(3)
where \({\hat{F}}_k(\tau ) = 1- {\hat{S}}_k(\tau ),\) \({\hat{R}}_k(\tau ) = \int _0^{\tau } {\hat{S}}_k(u) du\), and \({\hat{S}}_k(\cdot )\) is the Kaplan-Meier estimator for \(S_k(\cdot ).\) Using the uniform consistency of the Kaplan-Meier estimator (Gill 1983), it can be shown that \({\hat{\eta }}_k(\tau _{1}, \tau _{2})\) converges in probability to \({\eta }_k(\tau _{1}, \tau _{2})\) as \(n_k\) goes to \(\infty .\) We consider
$$ \begin{aligned} Q_{k} = & n_{k}^{{1/2}} \left\{ {\log \hat{\eta }_{k} (\tau _{1} ,\tau _{2} ) - \log \eta _{k} (\tau _{1} ,\tau _{2} )} \right\} \\ = & n_{k}^{{1/2}} \left( {\log \left\{ {\hat{F}_{k} (\tau _{2} ) - \hat{F}_{k} (\tau _{1} )} \right\} - \log \left\{ {F_{k} (\tau _{2} ) - F_{k} (\tau _{1} )} \right\}} \right. \\ & \quad \left. { - \left[ {\log \left\{ {\hat{R}_{k} (\tau _{2} ) - \hat{R}_{k} (\tau _{1} )} \right\} - \log \left\{ {R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )} \right\}} \right]} \right). \\ \end{aligned} $$
(4)
In Appendix A, we show that \(Q_k\) converges weakly to a normal distribution with mean 0 and variance
$$\begin{aligned} \begin{aligned} V(Q_k)&=\int _0^{\tau _{2}} \left\{ \frac{{S}_k(\tau _{2})-I(u\le \tau _{1}) {S}_k(\tau _{1}) }{{F}_k(\tau _{2})-{F}_k(\tau _{1})} \right. \\& \quad + \left. \frac{\int _u^{\tau _{2}} {S}_k(t)dt - I(u\le \tau _{1}) \int _u^{\tau _{1}} {S}_k(t)dt}{{R}_k(\tau _{2})-{R}_k(\tau _{1})} \right\} ^{2}\frac{dH_{k}(u)}{G_{k}(u)}, \end{aligned} \end{aligned}$$
where \(H_k(t)\) is the cumulative hazard function of \(T_{k}\) and \(G_k(t) = \Pr (T_k \wedge C_k \ge t).\) We can get an estimate for \(V(Q_k)\) by replacing the unknown quantities with their empirical counterparts, as shown below.
$$\begin{aligned} \begin{aligned} {\hat{V}}(Q_{k})&=\int _{0}^{\tau _{2}}\left\{ \frac{{\hat{S}}_{k}(\tau _{2})-I(u\le \tau _{1}){\hat{S}}_{k}(\tau _{1})}{{\hat{F}}_{k}(\tau _{2})-{\hat{F}}_{k}(\tau _{1})} \right. \\&+ \left. \frac{\int _{u}^{\tau _{2}}{\hat{S}}_{k}(t)dt-I(u\le \tau _{1})\int _{u}^{\tau _{1}}{\hat{S}}_{k}(t)dt}{{\hat{R}}_{k}(\tau _{2})-{\hat{R}}_{k}(\tau _{1})}\right\} ^{2}\frac{d{\hat{H}}_{k}(u)}{{\hat{G}}_{k}(u)}, \end{aligned} \end{aligned}$$
where \({\hat{G}}_k(t) = n^{-1}_k \sum _{i=1}^{n_k} I(X_{ki}\ge t),\) and \({\hat{H}}_k(\cdot )\) is the Nelson-Aalen estimator for \(H_k(\cdot )\) for group k. Using these results, an \((1-\alpha )\) asymptotic confidence interval (CI) for the LT-AH in group k will be derived by
$$\begin{aligned} \exp \left\{ \log {\hat{\eta }}_k(\tau _{1}, \tau _{2}) \pm z_{1-\alpha /2} \sqrt{{\hat{V}}(Q_k)/n_k} \right\} , \end{aligned}$$
where \(z_{1-\alpha /2}\) is the \((1-\alpha /2)\times 100 \)-percentile of the standard normal distribution.

2.2 Between-group contrast measures derived from LT-AH

Using the group-specific LT-AH’s from two groups, we can summarize the magnitude of the treatment effect in both absolute difference and relative terms. In this section, we describe the results regarding the inference of the ratio of LT-AH and the difference in LT-AH.

2.2.1 Ratio of LT-AH

We estimate the ratio of LT-AH
$$\begin{aligned} \theta (\tau _{1}, \tau _{2}) = \frac{ \eta _1(\tau _{1}, \tau _{2}) }{\eta _0(\tau _{1}, \tau _{2})} = \frac{ F_1(\tau _{2})-F_1(\tau _{1}) }{F_0(\tau _{2})-F_0(\tau _{1}) } \frac{ R_0(\tau _{2})-R_0(\tau _{1}) }{R_1(\tau _{2})-R_1(\tau _{1}) } \end{aligned}$$
by
$$\begin{aligned} {\hat{\theta }}(\tau _{1}, \tau _{2}) = \frac{{\hat{\eta }}_1(\tau _{1}, \tau _{2})}{{\hat{\eta }}_0(\tau _{1}, \tau _{2})}. \end{aligned}$$
For the inference of \(\theta (\tau _{1}, \tau _{2})\), we consider the asymptotic distribution of
$$\begin{aligned} \begin{aligned}&n^{1/2} \left\{ \log {\hat{\theta }}(\tau _{1}, \tau _{2}) - \log {\theta }(\tau _{1}, \tau _{2}) \right\} \\&\quad = n^{1/2} \left\{ \log {\hat{\eta }}_{1}(\tau _{1}, \tau _{2}) - \log {\eta }_{1}(\tau _{1}, \tau _{2}) \right\} - n^{1/2} \left\{ \log {\hat{\eta }}_{0}(\tau _{1}, \tau _{2}) - \log {\eta }_{0}(\tau _{1}, \tau _{2}) \right\} . \end{aligned} \end{aligned}$$
From the results regarding \(Q_k\) described in Section 2.1, it is shown that \(n^{1/2} \left\{ \log {\hat{\theta }}(\tau _{1}, \tau _{2}) - \log {\theta }(\tau _{1}, \tau _{2}) \right\} \) converges weakly to a normal distribution with mean zero and variance \( p^{-1}_1{V}(Q_1) + p^{-1}_0{V}(Q_0).\) The variance can be estimated by \({\hat{p}}_1^{-1}{\hat{V}}(Q_1) + {\hat{p}}_0^{-1}{\hat{V}}(Q_0),\) where \({\hat{p}}_k = n_k/n,\) for \(k=0,1.\) Therefore, an \((1-\alpha )\) asymptotic CI for \(\theta (\tau )\) is
$$\begin{aligned} \exp \left\{ \log {\hat{\theta }}(\tau _{1}, \tau _{2}) \pm z_{1-\alpha /2} \sqrt{n_1^{-1}{\hat{V}}(Q_1) + n_0^{-1}{\hat{V}}(Q_0)} \right\} . \end{aligned}$$
(5)
For testing the null hypothesis, \(\theta (\tau _{1}, \tau _{2}) = 1,\)
$$\begin{aligned} \log {\hat{\theta }}(\tau _{1}, \tau _{2})/ \sqrt{n_1^{-1}{\hat{V}}(Q_1) + n_0^{-1}{\hat{V}}(Q_0)} \end{aligned}$$
(6)
is used as the test statistic, which asymptotically follows the standard normal distribution under the null hypothesis.

2.2.2 Difference in LT-AH

We estimate the difference in LT-AH
$$\begin{aligned} \xi (\tau _{1}, \tau _{2}) = \eta _1(\tau _{1}, \tau _{2}) - \eta _0(\tau _{1}, \tau _{2}) = \frac{F_1(\tau _{2})-F_1(\tau _{1})}{R_1(\tau _{2})-R_1(\tau _{1})}-\frac{F_0(\tau _{2})-F_0(\tau _{1})}{R_0(\tau _{2})-R_0(\tau _{1})} \end{aligned}$$
by
$$\begin{aligned} {\hat{\xi }}(\tau _{1}, \tau _{2})={\hat{\eta }}_1(\tau _{1}, \tau _{2}) - {\hat{\eta }}_0(\tau _{1}, \tau _{2}). \end{aligned}$$
For hypothesis testing and interval estimation, we consider the following asymptotic distribution
$$\begin{aligned} \begin{aligned}&n^{1/2} \left\{ {\hat{\xi }}(\tau _{1}, \tau _{2}) - {\xi }(\tau _{1}, \tau _{2}) \right\} \\&\quad = n^{1/2} \left\{ {\hat{\eta }}_{1}(\tau _{1}, \tau _{2}) - {\eta }_{1}(\tau _{1}, \tau _{2}) \right\} - n^{1/2} \left\{ {\hat{\eta }}_{0}(\tau _{1}, \tau _{2}) - {\eta }_{0}(\tau _{1}, \tau _{2}) \right\} . \end{aligned} \end{aligned}$$
(7)
In Appendix B, it is shown that
$$\begin{aligned} \begin{aligned} U_k&= n_k^{1/2} \left\{ {\hat{\eta }}_{k}(\tau _{1}, \tau _{2}) - {\eta }_{k}(\tau _{1}, \tau _{2}) \right\} \\&= n_k^{1/2} \left\{ \frac{ {\hat{F}}_k(\tau _{2}) - {\hat{F}}_k(\tau _{1}) }{ {\hat{R}}_k(\tau _{2}) - {\hat{R}}_k(\tau _{1})} - \frac{ F_k(\tau _{2}) - F_k(\tau _{1}) }{R_k(\tau _{2}) - R_k(\tau _{1})}\right\} \end{aligned} \end{aligned}$$
(8)
converges weakly to a normal distribution with mean 0 and variance
$$\begin{aligned} V(U_{k} ) = & \int_{0}^{{\tau _{2} }} {\left[ {\frac{{S_{k} (\tau _{2} ) - I(u \le \tau _{1} )S_{k} (\tau _{1} )}}{{R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )}}} \right.} \\ & \quad + \left. {\frac{{F_{k} (\tau _{2} ) - F_{k} (\tau _{1} )}}{{\left\{ {R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )} \right\}^{2} }}\left\{ {\int_{u}^{{\tau _{2} }} {S_{k} } (t)dt - I(u \le \tau _{1} )\int_{u}^{{\tau _{1} }} {S_{k} } (t)dt} \right\}} \right]^{2} \frac{{dH_{k} (u)}}{{G_{k} (u)}}. \\ \end{aligned} $$
Applying this result to the Eq. (7), it is shown that
\(n^{1/2} \left\{ {\hat{\xi }}(\tau _1,\tau _2) - {\xi }(\tau _1,\tau _2) \right\} \) converges weakly to a normal distribution with mean 0 and variance \( p^{-1}_1{V}(U_1) + p^{-1}_0{V}(U_0)\). This can be estimated by replacing the unknown quantities with their empirical counterparts, \( {\hat{p}}^{-1}_1{\hat{V}}(U_1) + {\hat{p}}^{-1}_0{\hat{V}}(U_0)\). Therefore, an \((1-\alpha )\) asymptotic CI for \(\xi (\tau _1,\tau _2)\) is given by
$$\begin{aligned} {\hat{\xi }}(\tau _1,\tau _2) \pm z_{1-\alpha /2} \sqrt{n^{-1}_1{\hat{V}}(U_1) + n^{-1}_0{\hat{V}}(U_0)}. \end{aligned}$$
(9)
For testing the null hypothesis that there is no difference in LT-AH with time window (\(\tau _{1}\),\(\tau _{2}\)) between two groups (i.e., \(\xi (\tau _1,\tau _2)=0\)), we will use
$$\begin{aligned} {\hat{\xi }}(\tau _1,\tau _2) / \sqrt{n^{-1}_1{\hat{V}}(U_1) + n^{-1}_0{\hat{V}}(U_0)} \end{aligned}$$
(10)
as the test statistic, which asymptotically follows the standard normal distribution under the null.

3 Numerical studies

We performed extensive numerical studies and evaluated finite sample properties of the proposed asymptotic 0.95 CIs for difference in LT-AH and ratio of LT-AH and asymptotic tests for no treatment effect (i.e., \(\xi (\tau _1,\tau _2)=0\) and \(\theta (\tau _1,\tau _2)=1,\) respectively).

3.1 Configurations

Figure 2 shows five different patterns for the event time distribution between two groups considered in our numerical studies. Pattern 2A denotes a no difference scenario. This was used to confirm that the sizes of the tests we examined in the numerical studies were at the conventional nominal level (i.e., 5.0%). Pattern 2B denotes a proportional hazards difference. For delayed difference scenarios, we considered three patterns: 2C to 2E. Specifically, two survival curves were identical up to time 2 and then (I) diverging in Pattern 2C, (II) almost parallel in Pattern 2D, and (III) converging in Pattern 2E as heading to the end of the follow-up. For all patterns, the event time in group 0, \(T_0,\) was generated from the Weibull distribution with parameters (shape, scale)=(1,10). For Pattern 2B, the event time in group 1, \(T_1,\) was generated from the Weibull distribution with parameters (shape, scale)=(1, 12.5). For Patterns 2C to 2E, we first generated a random number, U, from a standard uniform distribution. We then transformed it to the event time in group 1 by \(T_{1} = S_{1}^{-1}(U),\) where \(S_{1}(t)\) is the survival function in group 1 as presented with the solid line in Fig. 2C–E, respectively.
Figure 3 shows three patterns of censoring time distributions considered in our numerical studies: no censoring pattern (3A), light censoring pattern (3B), and moderate censoring pattern (3C). We used the Weibull distributions with parameters (shape, scale)=(3.871, 14.189) for Pattern (3B), and (shape, scale)=(2.818, 10.233) for Pattern (3C). The administrative censoring at time 10 was applied to all three censoring time distribution patterns. The censoring time distribution was not group-specific but study-specific in our numerical studies. As such, a total of 15 combinations of difference in the event time distribution and censoring time distribution were investigated.
From each of the 15 combinations, we generated \(n_k\) pairs of data points for each group \(\{(T_{ki},C_{ki}); \ k=0, 1,\ i=1,\ldots ,n_{k}\}\) independently, where the censoring time was independent of the event time. We then derived the observable data \(\left\{ (X_{ki},\varDelta _{ki});\ k=0,1,\ i=1,\ldots ,n_{k}\right\} \), where \(X_{ki}=T_{ki} \wedge C_{ki}\) and \(\varDelta _{ki}=I(T_{ki}\le C_{ki}).\) We then calculated the difference in LT-AH and the ratio of LT-AH, and the corresponding 0.95 CIs, using (5) and (9), respectively. The asymptotic tests based on the difference in LT-AH and the ratio of LT-AH were also performed, using the test statistics (6) and (10), respectively, at two-sided 0.05 \(\alpha \) level.
Repeating this process 5,000 times, we assessed the empirical bias of the proposed point estimators for difference in LT-AH and ratio of LT-AH, the average of the asymptotic standard error, the empirical standard error, the empirical coverage probability of the 0.95 CIs, and the average length of 0.95 CIs. We also evaluated the empirical size and power of the asymptotic tests. As comparators, we included the score test based on Cox’s method, and tests based on the standard RMST, LT-RMST, and AH. For the standard RMST and AH, the time window [0, 10] was considered. The time ranges [1, 10], [2, 10] and [3, 10] were used for the analysis based on LT-AH and LT-RMST. We considered three kinds of sample size for each scenario: \(n_{k}=100, 200, \text {and}, 300\, (k=0,1).\)

3.2 Results

Table 1 summarizes the performance of the difference in LT-AH and ratio of LT-AH with a sample size of 100 per arm. Regarding the difference in LT-AH, the absolute value of the empirical bias of the proposed estimator was less than 0.0005 in all scenarios. The average of the asymptotic standard error (ASE) was almost identical to the empirical standard error (ESE) across all scenarios, indicating that the asymptotic SE provided a reliable approximation of the true variability. The empirical coverage probabilities were close to the nominal level of 0.95 (ranging from 0.945 to 0.953). The average length of the CI increased slightly as the censoring increased, as expected. The results of bias and coverage with sample sizes of 200 and 300 per arm were almost identical to those of Table 1 (Supplementary Tables 1 and 2).
Table 1
Performance of the difference and ratio of long-term average hazard with survival weight with sample size 100 per arm and the time range [2, 10]
Censoring
Difference in LT-AH
Ratio of LT-AH
True
Bias
\(\text {ASE}^{*^1}\)
\(\text {ESE}^{*^2}\)
CP
AL
True
Bias
\(\text {ASE}^{*^3}\)
\(\text {ESE}^{*^4}\)
CP
AL
Event time distribution pattern: No difference
None
0.000
\(-\)0.000
0.021
0.021
0.953
0.082
1.000
0.021
0.210
0.212
0.951
0.865
Light
0.000
0.000
0.022
0.022
0.953
0.085
1.000
0.027
0.216
0.217
0.951
0.894
Moderate
0.000
0.000
0.024
0.024
0.948
0.094
1.000
0.034
0.241
0.246
0.948
1.015
Event time distribution pattern: PH difference
None
\(-\)0.020
\(-\)0.000
0.019
0.020
0.953
0.076
0.800
0.016
0.216
0.218
0.950
0.712
Light
\(-\)0.020
0.000
0.020
0.020
0.949
0.078
0.800
0.021
0.222
0.226
0.947
0.737
Moderate
\(-\)0.020
0.000
0.022
0.023
0.946
0.088
0.800
0.027
0.249
0.254
0.948
0.840
Event time distribution pattern: Delayed difference I
None
\(-\)0.025
\(-\)0.000
0.019
0.019
0.952
0.076
0.750
0.015
0.222
0.224
0.952
0.684
Light
\(-\)0.025
0.000
0.020
0.020
0.952
0.078
0.750
0.019
0.228
0.232
0.949
0.709
Moderate
\(-\)0.025
0.000
0.022
0.022
0.947
0.087
0.750
0.026
0.256
0.261
0.949
0.808
Event time distribution pattern: Delayed difference II
None
\(-\)0.024
\(-\)0.000
0.019
0.019
0.951
0.074
0.762
0.015
0.213
0.215
0.951
0.666
Light
\(-\)0.024
\(-\)0.000
0.019
0.020
0.949
0.076
0.762
0.019
0.220
0.224
0.948
0.693
Moderate
\(-\)0.024
0.000
0.022
0.022
0.947
0.086
0.762
0.025
0.250
0.255
0.946
0.800
Event time distribution pattern: Delayed difference III
None
\(-\)0.021
\(-\)0.000
0.019
0.019
0.949
0.073
0.791
0.015
0.207
0.209
0.951
0.673
Light
\(-\)0.021
\(-\)0.000
0.019
0.020
0.949
0.076
0.791
0.019
0.215
0.217
0.950
0.701
Moderate
\(-\)0.021
0.000
0.022
0.022
0.945
0.086
0.791
0.025
0.245
0.250
0.947
0.814
Event time distribution pattern: no difference, Fig. 2A; PH difference, Fig. 2B; delayed difference I, Fig. 2C; delayed difference II, Fig. 2D; delayed difference III, Fig. 2E. Censoring time distribution pattern: no censoring, Fig. 3A; light censoring, Fig. 3B; moderate censoring, Fig. 3C. \(^{*^1}\): Average of the asymptotic standard error for the difference in LT-AH over 5000 iterations. \(^{*^2}\): Standard deviation of the point estimates for the difference in LT-AH derived from 5000 iterations. \(^{*^3}\): Average of the asymptotic standard error for the log-transformed ratio of LT-AH over 5000 iterations. \(^{*^4}\): Standard deviation of the point estimates for the log-transformed ratio of LT-AH derived from 5000 iterations. Abbreviations: LT-AH, long-term average hazard; True, the true value; Bias, the empirical bias (estimate minus true value); ASE, the average standard error; ESE, the empirical standard error; CP, the empirical coverage probability of the 0.95 confidence interval; AL, the average length of the 0.95 confidence interval; Censoring, censoring time distribution pattern; PH, proportional hazards
Table 2
Size and power of tests based on Cox’s hazard ratio, difference in average hazard, difference in long-term average hazard, difference in restricted mean survival time, and difference in long-term restricted mean survival time with sample size 200 per arm and various time ranges [\(\tau _1\), \(\tau _2\)]
 Censoring
Cox’s HR
Difference in
AH
LT-AH
LT-AH
LT-AH
RMST
LT-RMST
LT-RMST
LT-RMST
[0,10]
[1,10]
[2,10]
[3,10]
[0,10]
[1,10]
[2,10]
[3,10]
Size of tests
Event time distribution pattern: No difference
None
0.047
0.047
0.047
0.044
0.047
0.048
0.050
0.048
0.048
Light
0.052
0.052
0.047
0.048
0.051
0.052
0.054
0.055
0.054
Moderate
0.049
0.047
0.049
0.049
0.052
0.051
0.052
0.052
0.051
Power of tests
Event time distribution pattern: PH difference
None
0.403
0.402
0.353
0.308
0.261
0.355
0.365
0.370
0.376
Light
0.386
0.383
0.330
0.291
0.245
0.352
0.360
0.362
0.368
Moderate
0.350
0.336
0.284
0.236
0.197
0.339
0.347
0.348
0.349
Event time distribution pattern: Delayed difference I
None
0.333
0.337
0.386
0.453
0.385
0.217
0.228
0.247
0.275
Light
0.316
0.325
0.363
0.429
0.365
0.215
0.227
0.246
0.272
Moderate
0.254
0.281
0.313
0.347
0.290
0.202
0.213
0.235
0.261
Event time distribution pattern: Delayed difference II
None
0.320
0.322
0.371
0.434
0.223
0.362
0.383
0.425
0.460
Light
0.332
0.313
0.350
0.407
0.210
0.360
0.378
0.417
0.450
Moderate
0.339
0.265
0.295
0.323
0.169
0.347
0.366
0.398
0.427
Event time distribution pattern: Delayed difference III
None
0.262
0.258
0.299
0.352
0.127
0.417
0.438
0.485
0.521
Light
0.288
0.250
0.284
0.330
0.123
0.409
0.429
0.472
0.507
Moderate
0.335
0.214
0.234
0.253
0.102
0.395
0.413
0.454
0.484
Event time distribution pattern: no difference, Fig. 2A; PH difference, Fig. 2B; delayed difference I, Fig. 2C; delayed difference II, Fig. 2D; delayed difference III, Fig. 2E. Censoring time distribution pattern: no censoring, Fig. 3A; light censoring, Fig. 3B; moderate censoring, Fig. 3C. Abbreviations: HR, hazard ratio; AH, average hazard; LT-AH [\(\tau _1\), \(\tau _2\)], long-term average hazard for the time range [\(\tau _1\), \(\tau _2\)]; RMST, restricted mean survival time; LT-RMST [\(\tau _1\), \(\tau _2\)], long-term restricted mean survival time for the time range [\(\tau _1\), \(\tau _2\)]; Censoring, censoring time distribution pattern; PH, proportional hazards
Regarding the ratio of LT-AH, the empirical bias ranged from 0.015 to 0.034 with sample size 100 per arm (Table 1). The bias with moderate censoring was relatively greater than that with no or light censoring. The ASE was nearly identical to the ESE across all scenarios, confirming the accuracy of the asymptotic variance approximation. The empirical coverage probabilities were close to the nominal level (ranging from 0.946 to 0.952). As the sample size increased, the empirical bias decreased. Specifically, the bias ranged from 0.006 to 0.017 with sample size 200 per arm and from 0.004 to 0.012 with sample size 300 per arm. Coverage probabilities with sample sizes 200 and 300 per arm were almost identical to those with sample size 100 per arm (Supplementary Table 1 and 2).
The results of one-sample estimations (i.e., inference of LT-AH for Group 0 and Group 1) with sample sizes of 100, 200, and 300 are presented in Supplementary Tables 3, 4, and 5, respectively. These results also demonstrate satisfactory performance, supporting the validity of the inference procedures.
The size and power of the asymptotic tests with a sample size of 200 per arm were summarized in Table 2. Since the number of iterations was 5000 and the true type I error rate is 5.0%, the empirical type I error rate should be within 4.4% to 5.6% with 95% chance. According to the pattern of no difference (Fig. 2A), the type I empirical error rates of all tests were within this range for all censoring patterns.
The power of each test was also summarized in Table 2. Under the proportional hazards difference pattern (Fig. 2B), Cox’s hazard ratio test showed the highest power as indicated by the theories. The test based on AH difference showed power comparable to Cox’s hazard ratio, followed by the LT-RMST, RMST, and LT-AH-based tests. Notably, the LT-RMST-based test exhibited higher power than the RMST-based test. This can be attributed to the fact that the signal in this approach (i.e., the difference in RMST) is smaller relative to the noise introduced by the early time window. The power of the LT-RMST-based test would decrease if a much larger \(\tau _1\) were selected. On the other hand, including the early time range improved the power of the LT-AH-based test. Unlike RMST and LT-RMST, where the signal is measured by differences in area under survival curves, the LT-AH approach quantifies the signal through differences in hazard. Although the difference in survival probability appears to be minimal at early time points (Fig. 2B), the difference in hazard is constant from time 0 in this scenario, contributing to the power of the LT-AH-based test. For the same reason, the LT-AH-based tests exhibited lower power than the AH-based test under the proportional hazards scenario. The same trends were observed in the three censoring patterns. The power decreased with increased censoring for each test.
For the delayed difference scenarios (Fig. 2C–E), the long-term versions of the AH-based and RMST-based tests showed higher power than their respective counterparts in all three delayed difference patterns from I to III, which supported using the long-term version when a delayed difference is expected and the true separation time point (i.e., \(\tau _1=2\) in this case) is specified. However, the choice between the LT-AH-based test and the LT-RMST-based test may depend on the delayed difference pattern. Specifically, in the delayed difference pattern I (i.e., a proportional hazards difference after separation; see Fig. 2C), the LT-AH-based test was more powerful than the LT-RMST-based test. However, the results were the opposite in the delayed difference pattern III (i.e., two survival curves do not diverge but converge after separation; see Fig. 2E). In the delayed difference pattern II (i.e., the two survival curves appear parallel after the separation point; see Fig. 2D), the LT-AH-based and LT-RMST-based tests showed similar power, but the superiority of these two tests also depended on the censoring pattern.
These studies also showed that the LT-AH-based test is superior to Cox’s hazard ratio test in delayed difference patterns when censoring is none or light. For example, under light censoring, the power of Cox’s hazard ratio test versus the LT-AH-based test with the time window of [2, 10] was (0.316 vs. 0.429), (0.332 vs. 0.407), and (0.288 vs. 0.330) for the delayed difference pattern I to III, respectively. On the other hand, the superiority of the LT-RMST-based test over Cox’s hazard ratio test depended on the pattern of delayed difference. Specifically, Cox’s hazard ratio test was more powerful than the LT-RMST-based test for the delayed difference pattern I (0.333 vs. 0.247; no censoring), but the results were the opposite in the delayed difference patterns II (0.320 vs. 0.425; no censoring) and III (0.262 vs. 0.485; no censoring).
The study also evaluated the power of the proposed tests when the optimal separation time point was not specified under delayed separation patterns (Fig. 2C–E). As expected, compared to the tests based on the optimal time window [2,10], those using a time window of [3,10] exhibited a substantial reduction in power. For example, in the delayed difference pattern III (Fig. 2E) with no censoring, the power of the LT-AH test was 0.352 with the time window [2,10] but decreased to 0.127 with [3,10]. In contrast, using a broader time window [1,10] resulted in a power of 0.299. These results suggest that users should be cautious when excluding early time range by selecting a smaller \(\tau _1,\) as the power loss from choosing a \(\tau _1\) that is too small is generally less severe than the loss from selecting a \(\tau _1\) that is too large.
We carried out power comparisons using ratios of AH, LT-AH, RMST and LT-RMST in parallel to the differences in these metrics, but we did not include the results here because they were almost identical. Also, we had the same findings from the results with sample sizes 100 and 300 per arm (Supplementary Tables 6 and 7).
In summary, when delayed onset of treatment benefit is expected, LT-AH or LT-RMST is recommended. They will offer higher power than the standard AH-based test and RMST-based test, respectively, with an appropriate choice of \(\tau _1.\) Especially when censoring is light, the LT-AH-based test will have a higher power than Cox’s hazard ratio test under many delayed difference patterns. On the other hand, LT-RMST-based test can be more powerful or less powerful than Cox’s hazard ratio test, depending on the delayed difference pattern. LT-AH would be preferable to LT-RMST when two survival curves keep diverging after the separation time point (Pattern I; Fig. 2C). Conversely, LT-RMST would be preferable to LT-AH if two survival curves diverge after the separation time point and then converge as they head to the end of the follow-up (Pattern III; Fig. 2E).

4 Example

To illustrate the proposed method, we used data from the CheckMate 214 study, a recently conducted randomized clinical trial for assessing immunotherapy in patients with previously untreated clear-cell advanced renal cell carcinoma. Figure 1A shows the Kaplan-Meier curves for PFS comparing nivolumab plus ipilimumab with sunitinib, where we reconstructed patient-level PFS data (Guyot et al. 2012) from the study results reported by Motzer et al. (2018). The estimated hazard ratio was 0.82 (0.95 CI: 0.68 to 0.99; p-value=0.037), favoring the nivolumab plus ipilimumab arm. The estimated survival curve for the nivolumab plus ipilimumab arm was almost identical to that for the sunitinib arm up to approximately 7 months, with a slight separation emerging around 5 months (Fig. 1A). The observed survival curves suggested a delayed difference pattern and a possible deviation from the proportional hazards assumption. The cumulative residual test (Lin et al. 1993) produced a p-value of 0.084, which does not reach statistical significance at the 0.05 alpha level. However, a non-significant result in a proportional hazards assumption test should not be interpreted as confirmation that the proportional hazards assumption holds (Stensrud and Hernán 2020). The estimated survival curves also suggested that the nivolumab plus ipilimumab arm might benefit PFS relatively long-term. However, the analysis results of the hazard-ratio-based test/estimation approach do not address the long-term treatment effect of the immunotherapy. Therefore, we conducted a post hoc analysis using the proposed LT-AH test/estimation approach to capture the long-term benefit, focusing on a time window [7, 21] (months) as illustrated by the shaded areas in Fig. 1B. As the sensitivity analysis, we also applied the same approach using the time window [5, 21] (months). In the subsequent analysis, we pretend that these time windows are pre-selected independently of the observed data.
The difference in LT-AH on [7, 21] (months) was \(-\)0.023 (with LT-AH values of 0.028 for the nivolumab and ipilimumab group and 0.051 for the sunitinib group) (Table 3). This can be interpreted as, on average, that immunotherapy reduces the event rate by 2.3 events per 100 person-months compared to sunitinib (0.95 CI \(-\)0.037 to \(-\)0.008; p = 0.002) within the study time window between 7 and 21 months. The ratio of LT-AH was 0.553 (0.95 CI 0.387 to 0.791; p = 0.001). Similar results were obtained using the [5, 21] (months) time window, with the difference in LT-AH of \(-\)0.023 (0.95 CI \(-\)0.038 to \(-\)0.009; p-value = 0.002) and the ratio of 0.606 (0.95 CI 0.451 to 0.815; p-value = 0.001) (Table 3). These findings illustrated that conducting the LT-AH analysis with different time windows can help confirm the robustness of the results in an exploratory context. Again, the validity of aforementioned inference results requires the pre-specification of the time window of interest.
Table 3
Estimated restricted mean survival times and average hazards with time range \([\tau _{1},\tau _{2}]\) for treatment group (nivolumab plus ipilimumab) and control group (sunitinib) with the data reconstructed from the publication of the CheckMate 214 study
 
Time range
\([\tau _{1},\tau _{2}]\)
(month)
Treatment (0.95 CI)
Control (0.95 CI)
\(\text {Difference}^*\) (0.95 CI; p-value)
\(\text {Ratio}^{**}\) (0.95 CI; p-value)
LT-AH
[5, 21]
0.036 (0.029 to 0.045)
0.059 (0.048 to 0.073)
\(-\)0.023 (\(-\)0.038 to \(-\)0.009; 0.002)
0.606 (0.451 to 0.815; 0.001)
LT-AH
[7, 21]
0.028 (0.022 to 0.037)
0.051 (0.040 to 0.065)
\(-\)0.023 (\(-\)0.037 to \(-\)0.008; 0.002)
0.553 (0.387 to 0.791; 0.001)
AH
[0, 21]
0.049 (0.042 to 0.057)
0.066 (0.057 to 0.076)
\(-\)0.017 (\(-\)0.029 to \(-\)0.005; 0.006)
0.747 (0.608 to 0.917; 0.005)
LT-RMST
[5, 21]
7.9 (7.1 to 8.7)
6.7 (6.0 to 7.4)
1.2 (0.1 to 2.3; 0.026)
1.2 (1.0 to 1.4; 0.025)
LT-RMST
[7, 21]
6.7 (5.9 to 7.3)
5.5 (4.8 to 6.1)
1.2 (0.2 to 2.1; 0.017)
1.2 (1.0 to 1.4; 0.017)
RMST
[0, 21]
12.2 (11.4 to 13.0)
11.0 (10.2 to 11.8)
1.2 (0.0 to 2.4; 0.043)
1.1 (1.0 to 1.2; 0.041)
LT-AH, long-term average hazard; AH, average hazard; LT-RMST, long-term restricted mean survival time; RMST, restricted mean survival time; CI, confidence interval. * Difference: Treatment − Control. A value below 0 is in favor of the treatment group for AH, and that above 0 is in favor of the treatment group for RMST. ** Ratio: Treatment/Control. A value below 1 is in favor of the treatment group for AH, and that above 1 is in favor of the treatment group for RMST
As a reference, we also applied the LT-RMST test/estimation approach. The difference in RMST over [7, 21] (months) was 1.2 months (with RMST values of 6.7 for the nivolumab and ipilimumab group and 5.5 for the sunitinib group). That is, PFS probability among future patients receiving the immunotherapy would be a mean of 6.7 months from months 7 to 21, which is 1.2 months longer than that among patients treated with sunitinib (0.95 CI 0.2 to 2.1; p-value = 0.017). The ratio of LT-RMST was 1.2 (0.95 CI 1.0 to 1.4; p-value = 0.017). When using the [5, 21] time window, the point estimates remained the same, although the p-values were slightly higher (0.026 and 0.025 for the difference and the ratio, respectively) (Table 3).
The p-values of the LT-AH test for the difference were smaller than those of the LT-RMST test (0.002 vs. 0.017 for the [7, 21] time window and 0.002 vs. 0.026 for the [5, 21] time window). This is because the observed difference pattern was similar to the pattern I (Fig. 2C) used in our numerical studies (i.e., a delayed difference pattern with two survival curves keep diverging after the separation time point).
Also, as a reference, we applied the standard AH-based and standard RMST-based methods with the time window [0, 21 months] (Table 3). The p-values of the LT-AH and LT-RMST comparison were lower than those of the standard AH-based and standard RMST-based tests, respectively. These show an advantage of using the long-term versions of these metrics in cases where the difference between groups was considered as noise in the early time window.

5 Selection of time window

A practical issue with the proposed method is how to specify the time window \([\tau _1, \tau _2]\). Similar to LT-RMST-based approaches, we recommend that this window be selected based on clinical considerations (Horiguchi et al. 2018).
In the context of confirmatory clinical trials, the specific time window should be pre-specified in the study protocol, as the results of the test depend on this choice—especially in the presence of non-proportional hazards or crossing survival curves. When prior studies suggest a delayed treatment effect, \(\tau _1\) may be chosen as the expected separation point between survival curves. All such recommendations are intended for confirmatory settings, where pre-specification is essential for valid inference.
In contrast, the illustrative example presented in this paper, based on the CheckMate 214 trial, was intended as an exploratory post hoc analysis. In this example, based on visual inspection of the already published Kaplan-Meier curves for the two groups, separation between the curves appeared at around 6 months, with slight and more pronounced divergence emerging at approximately 5 and 7 months, respectively. Therefore, in this post hoc analysis, we selected \(\tau _1 = 5\) and 7 months and presented results for both. While such post hoc selection based on visual inspection or the use of multiple \(\tau _1\) values without adjustment for multiple comparisons is not appropriate for confirmatory settings, it can be informative for exploratory analyses and hypothesis generation.
When multiple candidate time points are considered for \(\tau _1\) in the confirmatory setting, appropriate methods are required to adjust for multiple comparisons. For example, an approach similar to that proposed by Horiguchi et al. (2023) for LT-RMST could be considered for LT-AH when evaluating multiple \(\tau _1\) values. However, further methodological development is needed to support such data-dependent selection of \(\tau _1\) under the confirmatory framework.
Regarding the choice of \(\tau _2\), Tian et al. (2020) demonstrated that, under mild conditions, the asymptotic properties of RMST remain valid when the largest observed time is used as the truncation point. However, they also noted that the required condition for valid inference at a fixed t-year event probability may not hold in typical clinical trial settings. Given that the AH combines RMST and the t-year event rate (the equation (1)), their results for RMST do not directly apply to AH or LT-AH. Further investigation is needed to evaluate the appropriateness of data-dependent choices of \(\tau _2\) for AH- and LT-AH-based analyses.
At present, ensuring that the size of the risk set at \(\tau _2\) does not become too small (e.g., fewer than 10 subjects) is critical for the stability of estimates and the validity of asymptotic inference. In practice, empirically selecting \(\tau _2\)—balancing the desire to capture long-term effects with the need to maintain sufficient follow-up—may be a reasonable strategy. Importantly, such an empirical selection strategy can also be incorporated into confirmatory analyses if the approach is clearly pre-specified in the study protocol. Finally, sensitivity analyses that examine results across multiple plausible combinations of \(\tau _1\) and \(\tau _2\) can provide valuable insight into the robustness of study conclusions.

6 Remarks

In this paper extending the AH-based approach proposed by Uno and Horiguchi (2023), we proposed a new test/estimation approach using LT-AH to focus on quantifying the long-term treatment benefit of time-to-event outcome. The proposed LT-AH-based approach will be particularly useful in some randomized clinical trials examining new therapies with potential delayed treatment effects, such as immunotherapy. Our numerical studies showed that LT-AH-based tests are more powerful than AH-based tests in detecting a delayed treatment effect. However, the LT-RMST-based tests demonstrated comparable performance. The superiority of these two long-term focused approaches depends on the pattern of difference in the two underlying survival functions after the separation time point. Specifically, the strength of LT-AH is to detect a delayed difference by which two survival curves keep diverging after the separation time point. On the other hand, the LT-RMST is more powerful in detecting such a delayed difference pattern that the difference in two survival curves after the separation time point is diminishing as the time elapses.
In addition to the power advantage in delayed difference scenarios, the proposed LT-AH approach can summarize the magnitude of the treatment effect in both absolute difference and relative terms using “hazard” (i.e., difference in LT-AH and ratio of LT-AH), meeting guideline recommendations and practical needs (Cobos-Carbo and Augustovski, 2011; Hopewell et al., 2025; Annals of Internal Medicine, 2024). Unlike Cox’s hazard ratio, the difference and ratio of LT-AH do not rely on model assumptions about the relationship between two groups, and these estimates are independent of the study-specific censoring time distribution. The proposed LT-AH approach provides a robust framework for estimating the magnitude of between-group differences and can be a useful alternative to the conventional log-rank/hazard-ratio test/estimation when anticipating delayed onset of treatment benefits on time-to-event outcomes.
Similarly to the piecewise log-rank test (Xu et al. 2017) introduced in Section 1, the LT-AH proposed in this paper is equivalent to a version of landmark analysis based on AH (see Appendix C). In an ideal application scenario, survival curves of the two treatment groups are identical up to the landmark time point, and the landmark analysis effectively compares the overall survival profile. When two survival curves differ before the landmark time point, we emphasize the need to exercise caution with the general limitations of the landmark analysis (Dafni 2011). Especially due to the potential violation of the intention-to-treat principle, a causal interpretation of the analysis results regarding the effect of the treatment on survival time would be almost impossible.
In this paper, we restricted delayed difference patterns to the cases where two survival functions were almost identical at the early study time. If two survival functions before the separation time point were not similar, the use of the long-term version of AH or RMST might not be appropriate. For this, we did not include so-called cross-survival cases, in which the Kaplan-Meier curves cross during follow-up, in our considerations or numerical studies. As discussed elsewhere (Horiguchi et al. 2023), for cross-survival scenarios, any single metric would be insufficient to inform which group is superior. Additional assessments using multiple metrics would be required.
The software for the implementation of the proposed method (LT-AH) is currently R. The latest version of the survAH R package (Uno et al. 2025) is available from a repository on the GitHub page of the corresponding author (https://uno1lab.com/survAH).

Declarations

Conflict of interest

The authors declare that they have no Conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Download
Title
Assessing delayed treatment benefits of immunotherapy using long-term average hazard: a novel test/estimation approach
Authors
Miki Horiguchi
Lu Tian
Kenneth L. Kehl
Hajime Uno
Publication date
14-10-2025
Publisher
Springer US
Published in
Lifetime Data Analysis / Issue 4/2025
Print ISSN: 1380-7870
Electronic ISSN: 1572-9249
DOI
https://doi.org/10.1007/s10985-025-09671-0

Appendix A: Large sample properties of \(Q_k\)

We start with noting well-known results about \({\hat{F}}_k(\cdot )\) and \({\hat{R}}_k(\cdot ).\) As it is shown by Fleming and Harrington (1991),
$$\begin{aligned} {n}_k^{1/2} \left\{ \frac{{\hat{F}}_k(\tau ) - {F}_k(\tau )}{1-F_k(\tau )} \right\} = n_k^{-1/2} \sum _{i=1}^{n_k} \int _0^{\tau } \frac{dM_{ki}(u)}{G_{k}(u)} +o_p(1), \end{aligned}$$
(A1)
and this converges weakly to a zero-mean normal distribution, where \(G_k(t) = \Pr (X_{k} \ge t),\) \(M_{ki}(t) = N_{ki}(t) - \int _0^t Y_{ki}(s)dH_{k}(s),\) \(N_{ki} (t) = I(X_{ki} \le t, \varDelta _{ki} =1), \) and \(Y_{ki} (t) = I(X_{ki} \ge t),\) \(H_k(t)\) is the cumulative hazard function of \(T_{k}\). Also, from the results shown by Zhao et al. (2012)
$$\begin{aligned} {n}_k^{1/2} \left\{ {\hat{R}}_k(\tau ) - {R}_k(\tau ) \right\} = - n_k^{-1/2} \sum _{i=1}^{n_k} \int _0^{\tau } \left\{ \int _{u}^{\tau } S_k(t)dt \right\} \frac{dM_{ki}(u)}{G_{k}(u)} + o_p(1), \end{aligned}$$
(A2)
which converges weakly to a zero-mean normal distribution.
Note that we assume that for \(\tau _{2}>\tau _{1}\ge 0,\) \(F_k(\tau _{2}) > F_k(\tau _{1}) \ge 0\) and \(R_k(\tau _{2})-R_k(\tau _{1}) > 0.\) Applying the Taylor series expansion to \(Q_k\) introduced in (4),
$$\normalsize \small \begin{aligned} Q_{k} = & n_{k}^{{1/2}} \left( {\left\{ {F_{k} (\tau _{2} ) - F_{k} (\tau _{1} )} \right\}^{{ - 1}} \left[ {\left\{ {\hat{F}_{k} (\tau _{2} ) - \hat{F}_{k} (\tau _{1} )} \right\} - \left\{ {F_{k} (\tau _{2} ) - F_{k} (\tau _{1} )} \right\}} \right]} \right) \\ & \quad - n_{k}^{{1/2}} \left( {\left\{ {R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )} \right\}^{{ - 1}} \left[ {\left\{ {\hat{R}_{k} (\tau _{2} ) - \hat{R}_{k} (\tau _{1} )} \right\} - \left\{ {R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )} \right\}} \right]} \right) + o_{p} (1). \\ \end{aligned} $$
(A3)
From the results of (A1),
$$\begin{aligned} \begin{aligned}&n_{k}^{1/2}\left[ {\hat{F}}_{k}(\tau _{2})-{\hat{F}}_{k}(\tau _{1})-\left\{ {F}_{k}(\tau _{2})-{F}_{k}(\tau _{1})\right\} \right] \\&\quad = n_k^{-1/2} \sum _{i=1}^{n_k} \int _0^{\tau _{2}} \left\{ {S}_k(\tau _{2}) - I(u\le \tau _{1}) {S}_k(\tau _{1}) \right\} \frac{dM_{ki}(u)}{G_{k}(u)} + o_p(1), \end{aligned} \end{aligned}$$
(A4)
which converges weakly to a zero-mean normal distribution. Also, using the results of (A2),
$$\normalsize \small \begin{aligned} & n_{k}^{{1/2}} \left[ {\hat{R}_{k} (\tau _{2} ) - \hat{R}_{k} (\tau _{1} ) - \left\{ {R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )} \right\}} \right] \\ = & - n_{k}^{{ - 1/2}} \sum\limits_{{i = 1}}^{{n_{k} }} {\int_{0}^{{\tau _{2} }} {\left\{ {\int_{u}^{{\tau _{2} }} {S_{k} } (t)dt - I(u \le \tau _{1} )\int_{u}^{{\tau _{1} }} {S_{k} } (t)dt} \right\}} } \frac{{dM_{{ki}} (u)}}{{G_{k} (u)}} + o_{p} (1), \\ \end{aligned} $$
(A5)
which converges weakly to a zero-mean normal distribution. Incorporating the results of (A4) and (A5) into (A3),
$$\begin{aligned} Q_{k} = & n_{k}^{{ - 1/2}} \sum\limits_{{i = 1}}^{{n_{k} }} {\int_{0}^{{\tau _{2} }} {\left\{ {\frac{{S_{k} (\tau _{2} ) - I(u \le \tau _{1} )S_{k} (\tau _{1} )}}{{F_{k} (\tau _{2} ) - F_{k} (\tau _{1} )}}} \right.} } \\ & \quad + \left. {\frac{{\int_{u}^{{\tau _{2} }} {S_{k} } (t)dt - I(u \le \tau _{1} )\int_{u}^{{\tau _{1} }} {S_{k} } (t)dt}}{{R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )}}} \right\}\frac{{dM_{{ki}} (u)}}{{G_{k} (u)}} + o_{p} (1). \\ \end{aligned} $$
Therefore, by the martingale central limit theorem, it is shown that \(Q_k\) converges weakly to a normal distribution with mean 0 and variance
$$\begin{aligned} \begin{aligned} V(Q_k)&=\int _0^{\tau _{2}} \left\{ \frac{{S}_k(\tau _{2}) - I(u \le \tau _{1}) {S}_k(\tau _{1})}{{F}_k(\tau _{2})-{F}_k(\tau _{1})} \right. \\& \quad\left. + \frac{\int _u^{\tau _{2}} {S}_k(t)dt - I(u\le \tau _{1}) \int _u^{\tau _{1}} {S}_k(t)dt}{{R}_k(\tau _{2})-{R}_k(\tau _{1})} \right\} ^{2} \frac{dH_{k}(u)}{G_{k}(u)}. \end{aligned} \end{aligned}$$

Appendix B: Large sample properties of \(U_k\)

To show large sample properties of \(U_k\), we rewrite (8) by
$$\begin{aligned} \begin{aligned} U_k&= n_k^{1/2} \left\{ \frac{ {\hat{F}}_k(\tau _{2}) - {\hat{F}}_k(\tau _{1}) }{ {\hat{R}}_k(\tau _{2}) - {\hat{R}}_k(\tau _{1})} - \frac{ F_k(\tau _{2}) - F_k(\tau _{1}) }{ {\hat{R}}_k(\tau _{2}) - {\hat{R}}_k(\tau _{1}) }\right\} \\& \quad + n_k^{1/2} \left\{ \frac{ F_k(\tau _{2}) - F_k(\tau _{1}) }{ {\hat{R}}_k(\tau _{2}) - {\hat{R}}_k(\tau _{1}) } - \frac{ F_k(\tau _{2}) - F_k(\tau _{1}) }{R_k(\tau _{2}) - R_k(\tau _{1})}\right\} . \end{aligned} \end{aligned}$$
By the application of Taylor series expansion, \(U_k\) is denoted by
$$\normalsize \begin{aligned} U_{k} = & n_{k}^{{1/2}} \left[ {\hat{F}_{k} (\tau _{2} ) - \hat{F}_{k} (\tau _{1} ) - \left\{ {F(\tau _{2} ) - F_{k} (\tau _{1} )} \right\}} \right]\left\{ {R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )} \right\}^{{ - 1}} \\ & \quad - n_{k}^{{1/2}} \left\{ {F_{k} (\tau _{2} ) - F_{k} (\tau _{1} )} \right\}\left\{ {R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )} \right\}^{{ - 2}} \left[ {\hat{R}_{k} (\tau _{2} ) - \hat{R}_{k} (\tau _{1} ) - \left\{ {R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )} \right\}} \right] \\ & \quad + o_{p} (1). \\ \end{aligned} $$
(A6)
Incorporating the results of (A4) and (A5) into (A6),
$$\normalsize \begin{aligned} U_{k} = & n_{k}^{{ - 1/2}} \sum\limits_{{i = 1}}^{{n_{k} }} {\int_{0}^{{\tau _{2} }} {\left[ {\frac{{S_{k} (\tau _{2} ) - I(u \le \tau _{1} )S_{k} (\tau _{1} )}}{{R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )}}} \right.} } \\ & \quad + \left. {\frac{{F_{k} (\tau _{2} ) - F_{k} (\tau _{1} )}}{{\left\{ {R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )} \right\}^{2} }}\left\{ {\int_{u}^{{\tau _{2} }} {S_{k} } (t)dt - I(u \le \tau _{1} )\int_{u}^{{\tau _{1} }} {S_{k} } (t)dt} \right\}} \right] \frac{{dM_{{ki}} (u)}}{{G_{k} (u)}} \\ & \quad + o_{p} (1). \\ \end{aligned} $$
Therefore, by the Martingale central limit theorem, it is shown that \(U_k\) converges weakly to a normal distribution with mean 0 and variance
$$\normalsize \begin{aligned} V(U_{k} ) = & \int_{0}^{{\tau _{2} }} {\left[ {\frac{{S_{k} (\tau _{2} ) - I(u \le \tau _{1} )S_{k} (\tau _{1} )}}{{R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )}}} \right.} \\ & \quad + \left. {\frac{{F_{k} (\tau _{2} ) - F_{k} (\tau _{1} )}}{{\left\{ {R_{k} (\tau _{2} ) - R_{k} (\tau _{1} )} \right\}^{2} }}\left\{ {\int_{u}^{{\tau _{2} }} {S_{k} } (t)dt - I(u \le \tau _{1} )\int_{u}^{{\tau _{1} }} {S_{k} } (t)dt} \right\}} \right]^{2} \frac{{dH_{k} (u)}}{{G_{k} (u)}}, \\ \end{aligned} $$
(A7)
where \(H_k(t)\) is the cumulative hazard function of \(T_{k}\). \(V(U_k)\) can be estimated by replacing the unknown quantities in (A7) with their empirical counterparts, as shown below.
$$\normalsize \normalsize \begin{aligned} \hat{V}(U_{k} ) = & \int_{0}^{{\tau _{2} }} {\left[ {\frac{{\hat{S}_{k} (\tau _{2} ) - I(u \le \tau _{1} )\hat{S}_{k} (\tau _{1} )}}{{\hat{R}_{k} (\tau _{2} ) - \hat{R}_{k} (\tau _{1} )}}} \right.} \\ & \quad + \left. {\frac{{\hat{F}_{k} (\tau _{2} ) - \hat{F}_{k} (\tau _{1} )}}{{\left\{ {\hat{R}_{k} (\tau _{2} ) - \hat{R}_{k} (\tau _{1} )} \right\}^{2} }}\left\{ {\int_{u}^{{\tau _{2} }} {\hat{S}_{k} } (t)dt - I(u \le \tau _{1} )\int_{u}^{{\tau _{1} }} {\hat{S}_{k} } (t)dt} \right\}} \right]^{2} \frac{{d\hat{H}_{k} (u)}}{{\hat{G}_{k} (u)}}, \\ \end{aligned} $$
where \({\hat{G}}_k(t) = n^{-1}_k \sum _{i=1}^{n_k} I(X_{ki}\ge t),\) and \({\hat{H}}_k(\cdot )\) is the Nelson-Aalen estimator for \(H_k(\cdot )\) for group k.

Appendix C: Connection to the landmark analysis

Let \(S_T(t)\) denote the survival function of the event time distribution T. In Section 2, we defined the LT-AH with a time window \((\tau _{1}, \tau _{2})\) as a function of \(S_T(t)\) as follows,
$$\begin{aligned} {\eta }_k(\tau _{1}, \tau _{2}) = \frac{S_{T}(\tau _{1})-S_{T}(\tau _{2})}{\int _{\tau _{1}}^{\tau _{2}}S_{T}(u)du}. \end{aligned}$$
Interestingly, \({\eta }_k(\tau _{1}, \tau _{2})\) can be interpreted as the standard AH over the time window \([0, \tau _2-\tau _1]\) for residual life time \(T^*=T-\tau _1\) conditional on \(T\ge \tau _1.\) Specifically, the standard AH (Uno and Horiguchi 2023) of \(T^*\) with the truncation time \(\tau _2 - \tau _1\) is
$$\begin{aligned} \frac{1-S_{T^*}(\tau _{2}-\tau _{1})}{\int _{0}^{\tau _{2}-\tau _{1}}S_{T^*}(u)du} = \frac{1-S_{T}(\tau _{2})/S_{T}(\tau _{1})}{\int _{0}^{\tau _{2}-\tau _{1}}S_{T}(u+\tau _{1})/S_{T}(\tau _{1})du} = \frac{S_{T}(\tau _{1})-S_{T}(\tau _{2})}{\int _{\tau _{1}}^{\tau _{2}}S_{T}(u)du}, \end{aligned}$$
because \(S_{T^*}(t)=S_{T}(t+\tau _{1})/S_{T}(\tau _{1}).\)
The nonparametric estimation and the corresponding inference procedures for \({\eta }_k(\tau _{1}, \tau _{2})\) introduced in Section 2 (3) were derived by replacing \(S_T(t)\) by its Kaplan-Meier estimator. It is equivalent to making inference on the standard AH of \(T^*\) using methods presented by Uno and Horiguchi (2023) based on the subgroup of patients whose \(X_{ik}>\tau _1.\)

Supplementary Information

Below is the link to the electronic supplementary material.
go back to reference A’Hern R (2016) Restricted mean survival time: an obligatory end point for time-to-event analysis in cancer trials? J Clin Oncol 34(28):3474–3476CrossRef
go back to reference Alexander BM, Schoenfeld JD, Trippa L (2018) Hazards of hazard ratios - deviations from model assumptions in immunotherapy. N Engl J Med 378(12):1158–1159CrossRef
go back to reference Annals of Internal Medicine (https://www.acpjournals.org/journal/aim/authors/statistical-guidance) Information for authors: General statistical guidance (section 4: Measures of effect and risk). Accessed July. 11, (2025)
go back to reference Bardo M, Huber C, Benda N, Brugger J, Fellinger T, Galaune V, Heinz J, Heinzl H, Hooker AC, Klinglmüller F, König F, Mathes T, Mittlböck M, Posch M, Ristl R, Friede T (2024) Methods for non-proportional hazards in clinical trials: a systematic review. Stat Methods Med Res 33(6):1069–1092.MathSciNetCrossRef
go back to reference Chappell R, Zhu X (2016) Describing differences in survival curves. JAMA Oncol 2(7):906–907CrossRef
go back to reference Chen T (2013) Statistical issues and challenges in immuno-oncology. J Immunother Cancer 1:18CrossRef
go back to reference Cobos-Carbo A, Augustovski F (2011) CONSORT 2010 declaration: updated guideline for reporting parallel group randomised trials. Med Clin 137(5):213–215
go back to reference Connolly SJ, Karthikeyan G, Ntsekhe M, Haileamlak A, El Sayed A, El Ghamrawy A, Damasceno A, Avezum A, Dans AML, Gitura B, Hu D, Kamanzi ER, Maklady F, Fana G, Gonzalez-Hermosillo JA, Musuku J, Kazmi K, Zühlke L, Gondwe L, Ma C, Paniagua M, Ogah OS, Molefe-Baikai OJ, Lwabi P, Chillo P, Sharma SK, Cabral TTJ, Tarhuni WM, Benz A, van Eikels M, Krol A, Pattath D, Balasubramanian K, Rangarajan S, Ramasundarahettige C, Mayosi B, Yusuf S (2022) Rivaroxaban in rheumatic heart disease-associated atrial fibrillation. N Engl J Med 387(11):978–988CrossRef
go back to reference Dafni U (2011) Landmark analysis at the 25-year landmark point. Circ Cardiovasc Qual Outcomes 4(3):363–371MathSciNetCrossRef
go back to reference Fleming T, Harrington D (1991) Counting Processes and Survival Analysis. John Wiley & Sons, New York
go back to reference Gill R (1983) Large sample behaviour of the product-limit estimator on the whole line. Ann Stat 11:49–58MathSciNetCrossRef
go back to reference Guimarães HP, Lopes RD, Barros E, Silva PGM, Liporace IL, Sampaio RO, Tarasoutchi F, Hoffmann-Filho CR, de Lemos Soares Patriota R, Leiria TLL, Lamprea D, Precoma DB, Atik FA, Silveira FS, Farias FR, Barreto DO, Almeida AP, Zilli AC, de Souza Neto JD, Cavalcante MA, Figueira FAMS, Kojima FCS, Damiani L, Santos RHN, Valeis N, Campos VB, Saraiva JFK, Fonseca FH, Pinto IM, Magalhães CC, Ferreira JFM, Alexander JH, Pavanello R, Cavalcanti AB, Berwanger O, RIVER Trial Investigators (2020) Rivaroxaban in patients with atrial fibrillation and a bioprosthetic mitral valve. N Engl J Med 383(22):2117–2126
go back to reference Guyot P, Ades A, Ouwens M, Welton N (2012) Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan–Meier survival curves. BMC Med Res Methodol 12(1):9CrossRef
go back to reference Hammad AY, Hodges JC, AlMasri S, Paniccia A, Lee KK, Bahary N, Singhi AD, Ellsworth SG, Aldakkak M, Evans DB, Tsai S, Zureikat A (2022) Evaluation of adjuvant chemotherapy survival outcomes among patients with surgically resected pancreatic carcinoma with node-negative disease after neoadjuvant therapy. JAMA Surg 158(1):55–62CrossRef
go back to reference Hopewell S, Chan AW, Collins GS, Hróbjartsson A, Moher D, Schulz KF, Tunn R, Aggarwal R, Berkwits M, Berlin JA, Bhandari N, Butcher NJ, Campbell MK, Chidebe RCW, Elbourne D, Farmer A, Fergusson DA, Golub RM, Goodman SN, Hoffmann TC, Ioannidis JPA, Kahan BC, Knowles RL, Lamb SE, Lewis S, Loder E, Offringa M, Ravaud P, Richards DP, Rockhold FW, Schriger DL, Siegfried NL, Staniszewska S, Taylor RS, Thabane L, Torgerson D, Vohra S, White IR, Boutron I (2025) CONSORT 2025 statement: updated guideline for reporting randomized trials. JAMA 333(22):1998–2005CrossRef
go back to reference Horiguchi M, Tian L, Uno H, Cheng S, Kim D, Schrag D, Wei LJ (2018) Quantification of long-term survival benefit in a comparative oncology clinical study. JAMA Oncol 4(6):881–882CrossRef
go back to reference Horiguchi M, Hassett M, Uno H (2019) How do the accrual pattern and follow-up duration affect the hazard ratio estimate when the proportional hazards assumption is violated? Oncologist 24(7):867–871CrossRef
go back to reference Horiguchi M, Tian L, Uno H (2023) On assessing survival benefit of immunotherapy using long-term restricted mean survival time. Stat Med 42(8):1139–1155MathSciNetCrossRef
go back to reference Huang B (2018) Some statistical considerations in the clinical development of cancer immunotherapies. Pharm Stat 17(1):49–60CrossRef
go back to reference Kalbfleisch J, Prentice R (1981) Estimation of the average hazard ratio. Biometrika 68(1):105–112MathSciNetCrossRef
go back to reference León LF, Lin R, Anderson KM (2020) On weighted log-rank combination tests and companion cox model estimators. Stat Biosci 12(2):225–245CrossRef
go back to reference Lin DY, Wei LJ (1989) The robust inference for the cox proportional hazards model. J Am Stat Assoc 84(408):1074–1078MathSciNetCrossRef
go back to reference Lin D, Wei L, Ying Z (1993) Checking the cox model with cumulative sums of martingale-based residuals. Biometrika 80(3):557–572MathSciNetCrossRef
go back to reference Motzer R, Tannir N, McDermott D, Arén Frontera O, Melichar B, Choueiri TK, Plimack ER, Barthélémy P, Porta C, George S, Powles T, Donskov F, Neiman V, Kollmannsberger CK, Salman P, Gurney H, Hawkins R, Ravaud A, Grimm MO, Bracarda S, Barrios CH, Tomita Y, Castellano D, Rini BI, Chen AC, Mekan S, McHenry MB, Wind-Rotolo M, Doan J, Sharma P, Hammers HJ, Escudier B, CheckMate 214 Investigators (2018) Nivolumab plus ipilimumab versus sunitinib in advanced renal-cell carcinoma. N Engl J Med 378(14):1277–1290CrossRef
go back to reference Paukner M, Chappell R (2021) Window mean survival time. Stat Med 40(25):5521–5533MathSciNetCrossRef
go back to reference Péron J, Roy P, Ozenne B, Roche L, Buyse M (2016) The net chance of a longer survival as a patient-oriented measure of treatment benefit in randomized clinical trials. JAMA Oncol 2(7):901–905CrossRef
go back to reference Rothman K, Greenland S, Lash T (2008) Modern epidemiology, 3rd edn. Lippincott Williams & Wilkins, Philadelphia
go back to reference Roychoudhury S, Anderson KM, Ye J, Mukhopadhyay P (2021) Robust design and analysis of clinical trials with nonproportional hazards: a straw man guidance from a cross-pharma working group. Stat Biopharm Res 15(2):280–294CrossRef
go back to reference Royston P, Parmar M (2011) The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt. Stat Med 30(19):2409–2421MathSciNetCrossRef
go back to reference Saad E, Zalcberg J, Péron J, Coart E, Burzykowski T, Buyse M (2018) Understanding and communicating measures of treatment effect on survival: Can we do better? J Natl Cancer Inst 110(3):232–240CrossRef
go back to reference Sanchis J, Bueno H, Miñana G, Guerrero C, Martí D, Martínez-Sellés M, Domínguez-Pérez L, Díez-Villanueva P, Barrabés JA, Marín F, Villa A, Sanmartín M, Llibre C, Sionís A, Carol A, García-Blas S, Calvo E, Gallardo MJM, Elízaga J, Gómez-Blázquez I, Alfonso F, del Blanco BG, Núñez J, Formiga F, Ariza-Solé A (2023) Effect of routine invasive vs conservative strategy in older adults with frailty and non-st-segment elevation acute myocardial infarction. JAMA Intern Med 183(5):407–415CrossRef
go back to reference Stensrud MJ, Hernán MA (2020) Why test for proportional hazards? JAMA 323(14):1401–1402CrossRef
go back to reference Tarone RE, Ware J (1977) On distribution-free tests for equality of survival distributions. Biometrika 64(1):156–160MathSciNetCrossRef
go back to reference Tian L, Fu H, Ruberg SJ, Uno H, Wei LJ (2018) Efficiency of two sample tests via the restricted mean survival time for analyzing event time observations. Biometrics 74(2):694–702MathSciNetCrossRef
go back to reference Tian L, Jin H, Uno H, Lu Y, Huang B, Anderson K, Wei L (2020) On the empirical choice of the time window for restricted mean survival time. Biometrics 76(4):1157–1166MathSciNetCrossRef
go back to reference Uno H, Horiguchi M (2023) Ratio and difference of average hazard with survival weight: new measures to quantify survival benefit of new therapy. Stat Med 42(7):936–952MathSciNetCrossRef
go back to reference Uno H, Claggett B, Tian L, Inoue E, Gallo P, Miyata T, Schrag D, Takeuchi M, Uyama Y, Zhao L, Skali H, Solomon S, Jacobus S, Hughes M, Packer M, Wei LJ (2014) Moving beyond the hazard ratio in quantifying the between-group difference in survival analysis. J Clin Oncol 32(22):2380–2385CrossRef
go back to reference Uno H, Wittes J, Fu H, Solomon S, Claggett B, Tian L, Cai T, Pfeffer MA, Evans SR, Wei LJ (2015) Alternatives to hazard ratios for comparing the efficacy or safety of therapies in noninferiority studies. Ann Intern Med 163(2):127–134CrossRef
go back to reference Uno H, Horiguchi M, Hassett MJ (2020) Statistical test/estimation methods used in contemporary phase III cancer randomized controlled trials with time-to-event outcomes. Oncologist 25(2):91–93CrossRef
go back to reference Uno H, Horiguchi M, Qian Z (2025) uno1lab/survAH. https://doi.org/10.5281/zenodo.15667795, version 1.1.2
go back to reference Xu Z, Zhen B, Park Y, Zhu B (2017) Designing therapeutic cancer vaccine trials with delayed treatment effect. Stat Med 36(4):592–605MathSciNetCrossRef
go back to reference Zhao Y, Zeng D, Socinski MA, Kosorok MR (2011) Reinforcement learning strategies for clinical trials in nonsmall cell lung cancer. Biometrics 67(4):1422–1433MathSciNetCrossRef
go back to reference Zhao L, Tian L, Uno H, Solomon SD, Pfeffer MA, Schindler JS, Wei LJ (2012) Utilizing the integrated difference of two survival functions to quantify the treatment contrast for designing, monitoring, and analyzing a comparative clinical study. Clin Trials 9(5):570–577CrossRef

Premium Partner

    Image Credits
    Salesforce.com Germany GmbH/© Salesforce.com Germany GmbH, IDW Verlag GmbH/© IDW Verlag GmbH, Diebold Nixdorf/© Diebold Nixdorf, Ratiodata SE/© Ratiodata SE, msg for banking ag/© msg for banking ag, C.H. Beck oHG/© C.H. Beck oHG, Governikus GmbH & Co. KG/© Governikus GmbH & Co. KG, Horn & Company GmbH/© Horn & Company GmbH, EURO Kartensysteme GmbH/© EURO Kartensysteme GmbH, Jabatix S.A./© Jabatix S.A.