Skip to main content
Top

Estimating treatment effects on duration with disease: a principal stratification framework

  • Open Access
  • 01-03-2026
Published in:

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This article delves into the estimation of treatment effects on duration outcomes within specific patient subgroups, focusing on the impact of intensive follow-up on cancer recurrence detection and duration. The study introduces a principal stratification framework to identify and estimate treatment effects within the subgroup of patients who would experience a positive duration under one treatment. It demonstrates how this causal effect can be identified from observational data under a monotonicity assumption and introduces a sensitivity parameter to evaluate the impact of potential violations of this assumption. The article also illustrates how the causal effect can be estimated using multi-state models in conjunction with pseudo-observations to handle censoring. Through simulations and real-world data examples, the study shows that hypothesis testing based on the causal effect in the principal stratum offers greater statistical power than comparisons based on the overall treated and control groups. The findings highlight the importance of quantifying treatment effects within specific subgroups of patients to better understand the benefits of interventions.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Clinical studies often aim to estimate the average treatment effect. However, when the effect of a treatment varies among individuals, conditional estimates may become more relevant (Hauck et al. 1998; Harrell 2021). In some cases, an intervention is specifically designed to benefit a particular subgroup of patients. For example, the current paper was motivated by the planning of a clinical study investigating the effect of intensive follow-up compared to standard follow-up for women treated for vulvar cancer, the Danish Vulva Cancer Recurrence Study (DaVulvaRec), registered at https://clinicaltrials.gov under NCT06495554. The primary outcome was the diagnosis of cancer recurrence within two years of follow-up. The hypothesis was that intensive follow-up, which included regular blood tests and symptom questionnaires, would result in more women being diagnosed with recurrence and that the diagnosis would occur earlier. At the start of the study, all participants have the potential to benefit from the intervention, making the average treatment effect—such as the overall increase in recurrence diagnoses—an initially relevant measure.
However, the direct benefit of the intervention applies only to women who actually experience a recurrence. For these patients, it is also important to quantify the treatment effect within the subgroup eligible for direct benefit. A key outcome for this group is the duration of time living with a recurrence diagnosis, conceptualized as an extended treatment window.
Estimating average treatment effects within covariate-defined subgroups is common in the causal inference literature (Hernán and Robins 2020), and these are referred to as conditional average treatment effects. When subgroups are defined based on variables observed after the start of the clinical study, the framework of principal stratification applies (Frangakis and Rubin 2002). In this context, such subgroups are referred to as principal strata. The ICH E9 (R1) addendum proposed principal stratification as one of five strategies for dealing with intercurrent events after the start of the clinical study (ICH E9 (R1) 2020).
In this work, we define principal strata based on potential outcomes in one of the treatment groups. Specifically, we focus on the average treatment effect for a duration outcome in the principal stratum of individuals who would experience a positive duration in the the treatment group with the highest event rate. We show that this causal effect can be identified from observational data under a monotonicity assumption and introduce a sensitivity parameter to evaluate the impact of potential violations of this assumption. In the DaVulvaRec study, the causal effect of interest pertains to patients diagnosed with recurrence under intensive follow-up. The monotonicity assumption posits that any patient diagnosed under standard follow-up would also be diagnosed under intensive follow-up. To address censoring, we adopt a multi-state modeling framework and employ censored multi-state models with pseudo-observations. The proposed methodology is illustrated through a sample size calculation and a simulated final analysis for the DaVulvaRec study, as well as an example using data from a randomized trial on colon cancer.

2 Method

Let T denote the time from an event of interest (e.g., diagnosis of disease) to either the end of follow-up at time \(\tau\) or death, whichever occurs first. If the event does not occur, we set \(T=0\). Define \(D=1(T>0)\) as as an indicator of the event having occurred and let Z indicate treatment assignment, with \(Z=1\) for treatment group and \(Z=0\) for control group. We consider a random sample of observations from each group, denoted as \((D_1,T_1,Z_1),\ldots ,(D_n,T_n,Z_n)\). One parameter of interest relates to the number of individuals with the event of interest,
$$\begin{aligned} \alpha _1=\text {P}(D=1|Z=1)-\text {P}(D=1|Z=0). \end{aligned}$$
With complete follow-up, this can be estimated using empirical proportions:
$$\begin{aligned} \hat{\alpha }_1=\frac{1}{n_1} \sum _{i=1}^{n}D_{i}1(Z_i=1)-\frac{1}{n_0} \sum _{i=1}^{n}D_{i}1(Z_i=0), \end{aligned}$$
where \(n_j\), \(j=0,1\), represents the number of observations in each of the two groups.
In the context of the vulvar cancer study, D indicates a diagnosis of cancer recurrence, and T represents the duration from diagnosis until either two years (\(\tau =2\)) or death, whichever occurs first. The treatment group receives intensive follow-up, while the control group receives standard follow-up. The effect of treatment within the study timeframe includes several components: (1) patients who would have been diagnosed with recurrence under standard care within two years may experience an earlier diagnosis under intensive follow-up; (2) patients who would not have been diagnosed within two years under standard care may receive a recurrence diagnosis during the same period with intensive follow-up; and (3) patients who would have died without a recurrence diagnosis under standard care may be diagnosed with recurrence before death under intensive follow-up. Thus, the duration during which a patient lives with a recurrence diagnosis (T), capturing all three components. Intensive follow-up is expected to improve outcomes for patients with recurrence. While it may not be possible to identify all patients with recurrence at two years, we can identify those diagnosed with recurrence in the intensive follow-up group and compare their outcomes to what would have occurred under standard follow-up. This type of comparison is possible in a counterfactual framework using potential outcomes. The principal strata of patients with a diagnosis of recurrence under the intensive follow-up is largest group of patients that we can identify that would benefit from the intervention.
Define (D(z), T(z)) as the potential outcomes under treatment \(Z=z\). We assume consistency (\(T(z)=T\) when \(Z=z\)) and exchangeability https://static-content.springer.com/image/art%3A10.1007%2Fs10985-025-09681-y/MediaObjects/10985_2025_9681_Figa_HTML.png , which typically holds in randomized trials. The causal estimand of interest is
$$\begin{aligned} \alpha _2=\text {E}(T(1)-T(0)|D(1)=1). \end{aligned}$$
The subgroup defined by \(D(1)=1\) represents a principal stratum in the sense of Frangakis and Rubin (2002). This causal effect is identifiable under a monotonicity assumption: any patient who would experience the event under control would also experience it under treatment. This is formalized in Proposition 1 with the proof given in the appendix.
Proposition 1
The causal estimand can be written as
$$\begin{aligned} \alpha _2 = \frac{1}{\text {P}(D=1|Z=1)}\left\{ \text {E}(T|Z=1) -\gamma \cdot \text {E}(T|Z=0) \right\} , \end{aligned}$$
(1)
where
$$\begin{aligned} \gamma&= \frac{\text {E}(T(0)1(D(1)=1))}{\text {E}(T(0))}. \end{aligned}$$
In general \(\gamma \le 1\). Under monotonicity, \(D(0)=1\) always implies \(D(1)=1\), we have \(T(0)1(D(1)=1)=T(0)\) and \(\gamma =1\).
Under the monotonicity assumption, we obtain
$$\begin{aligned} \text {E}(T(1)-T(0)|D(1)=1) = \frac{1}{\text {P}(D=1|Z=1)}\left\{ \text {E}(T|Z=1) -\text {E}(T|Z=0) \right\} . \end{aligned}$$
(2)
Since \(\gamma \le 1\), the expression in Eq. (2) generally provides a lower bound for the target parameter whenever \(\text {E}(T|Z=1)\ge \text {E}(T|Z=0)\). In this representation, the term \(\text {E}(T|Z=1)-\text {E}(T|Z=0)\) corresponds to the average treatment effect on duration for the overall population, while the factor \(1/\text {P}(D=1|Z=1)\) rescales this effect to reflect the subpopulation of interest.
Although monotonicity cannot be tested directly, it entails that the event rate of positive duration is higher in the treatment group than in the control group:
$$\begin{aligned} \text {P}(D=1|Z=0)=\text {P}(D(0)=1)<\text {P}(D(1)=1)=\text {P}(D=1|Z=0), \end{aligned}$$
a condition that can be empirically evaluated. On this basis, the principal stratum is identified as the group with the higher event rate. In settings where \(Z=1\) denotes a new treatment and \(Z=0\) corresponds to a standard or no treatment, and where the reverse monotonicity assumption holds (i.e., \(D(1)=1\) always implies \(D(0)=1\)), analogous expressions to Eqs. (1) and (2) applies, with the roles of 0 and 1 reversed.
In general, the parameter \(\gamma\) is not identifiable from the observed data and is therefore introduced as a sensitivity parameter to assess the impact of deviations from monotonicity. One approach is to impose assumptions directly on \(\gamma\). Alternatively, \(\gamma\) can be expressed as
$$\begin{aligned} \gamma&=\frac{\text {E}(T(0)|D(1)=1)}{\text {E}(T(0)|D(0)=1)}\cdot \frac{\text {P}(D=1|Z=1)}{\text {P}(D=1|Z=0)}. \end{aligned}$$
(3)
The first ratio reflects the relative mean duration under control between individuals with positive duration in the treatment group versus those with positive duration in the control group; this component is not identifiable from the observed data. The second ratio, in contrast, captures the relative frequency of positive duration in the treatment group compared to the control group, and is identifiable. Thus, another strategy is to place assumptions specifically on the first ratio in Eq. (3).
In the absence of censoring, the estimate for \(\alpha _2\) under monotonicity becomes
$$\begin{aligned} \hat{\alpha }_2 = \frac{1}{\frac{1}{n_1}\sum _{i=1}^{n}D_{i}1(Z_i=1)}\left( \frac{1}{n_1}\sum _{i=1}^{n}T_{i}1(Z_i=1) -\frac{1}{n_0}\sum _{i=1}^{n}T_{i}1(Z_i=0)\right) . \end{aligned}$$
(4)
Estimation and inference can proceed via generalized estimating equations (GEE) for the outcomes \((D_{i},T_{i})\), treating \(Z_i\) as covariate and using robust standard errors to account for correlation between \(D_{i}\) and \(T_{i}\) (Liang and Zeger 1986). The estimate and variance of \(\hat{\alpha }_2\) are subsequently obtained using the delta method.
Define the group-specific outcome parameters:
$$\begin{aligned} \beta =(\text {P}(T_{i}>0|Z=0),\text {E}(T_{i}|Z=0),\text {P}(T_{i}>0|Z=1),\text {E}(T_{i}|Z=1)). \end{aligned}$$
Then \(\alpha _1=\beta _3-\beta _1\) and, assuming monotonicity, \(\alpha _2=\beta _{3}^{-1}(\beta _{4}-\beta _{2})\). It is shown in the appendix that the asymptotic variance of \(\hat{\alpha }_2\) is of the form
$$\begin{aligned} \sigma _0^2/n_0+\sigma _1^2/n_1, \end{aligned}$$
with
$$\begin{aligned} \sigma _0^2&= \frac{1}{\beta _{3}^2} \text {Var}(T|Z=0) \\ \sigma _1^2&=\frac{1}{\beta _{3}^2}\text {Var}(T|Z=1) -\frac{1}{\beta _{3}^3}(1-\beta _{3})(\beta _{4}^2-\beta _{2}^2) . \end{aligned}$$
We will revisit sample size calculation later; for now, it is sufficient to specify the values of \(\alpha _2\), \(\sigma _0\), and \(\sigma _1\), and the sample size calculation is similar to a two-sample mean comparison with unequal standard deviations.
Combining terms, we can express the variance as:
$$\begin{aligned} \left( \frac{1}{\beta _{3}^2} \text {Var}(T|Z=0)/n_0+\frac{1}{\beta _{3}^2}\text {Var}(T|Z=1)/n_1\right) -\frac{1}{\beta _{3}^3}(1-\beta _{3})(\beta _{4}^2-\beta _{2}^2)/n_1. \end{aligned}$$
(5)
The first term of Eq. (5) is also the asymptotic variance of the up-weighted estimate with known weights,
$$\begin{aligned} \hat{\alpha }_2^K&:=\frac{1}{\text {P}(D=1|Z=1)}\left( \frac{1}{n_1}\sum _{i=1}^{n}T_{i}1(Z_i=1) -\frac{1}{n_0}\sum _{i=1}^{n}T_{i}1(Z_i=0)\right) \\&=: \frac{1}{\text {P}(D=1|Z=1)}\hat{\alpha }_2^*. \end{aligned}$$
When \(\text {E}(T|Z=1)> \text {E}(T|Z=0)\), the second term in Eq. (5) becomes negative, implying that the variance of \(\hat{\alpha }_2\) is strictly smaller than that of \(\hat{\alpha }_2^K\). Consequently, hypothesis tests based on \(\hat{\alpha }_2\) achieve greater statistical power. The inequality \(\text {E}(T|Z=1)> \text {E}(T|Z=0)\) is guaranteed under the stronger monotonicity condition \(T(1)> T(0)\).

2.1 The extended illness-death model

It is often useful to model the distribution of T within a multi-state framework, which naturally accommodates censoring. For the vulval cancer data, we adopt an extended illness-death model, as illustrated in Fig. 1. In this model, the quantity T represents the duration of time spent in state 2: Alive with a diagnosis of disease.
Let X(t) denote the state occupied by a participant at time t and define the transition probabilities as
$$\begin{aligned} \text {P}_{hj}(s,t)=\text {P}(X(t)=j|X(s)=h, {{\mathcal {F}}}_{s-}), \end{aligned}$$
where \(h,j\in {{\mathcal {S}}}\), the set of all possible states, and \({{\mathcal {F}}}_{s-}=\sigma (X(u), u<s)\) is the past history of the multi-state process X. The state occupation probabilities are given by
$$\begin{aligned} Q_h(t)=\text {P}(X(t)=h)=\sum _j Q_j(0)P_{hj}(0,t), \; h\in {{\mathcal {S}}}, \end{aligned}$$
where Q(0) is the distribution of the initial state. In the vulval cancer application, the state space is \({{\mathcal {S}}}=\{ 1,2,3,4\}\), where 1: Alive without a diagnosis of the disease; 2: Alive with a diagnosis of the disease; 3: Death without a diagnosis of the disease; 4: Death with a diagnosis of the disease. Here all women are in the same state (1) at time 0, i.e. \(Q_0(0)=1\). Transition intensities are defined as
$$\begin{aligned} \lambda _{hj}(t|{{\mathcal {F}}}_{s-})=\lim _{\varDelta t\rightarrow 0}\frac{\text {P}_{hj}(t,t+\varDelta t|{{\mathcal {F}}}_{s-})}{\varDelta t}, \; h,j\in {{\mathcal {S}}}, \; h\ne j. \end{aligned}$$
The process is Markovian if the intensities depend only on the current state h at time t, and any time-fixed covariates, but not the full past history. In the extended illness-death model, the Markov property holds provided that the death rate (intensity for the transition \(2\rightarrow 4\)) does not depend on the time of diagnosis of disease (entry into state 2). In practice, however, this assumption is often implausible. Importantly, the proposed approach does not rely on the Markov property.
The \(\beta\) parameter vector can now be written in terms of the multi-state process as
$$\begin{aligned} \beta&=(Q_{2}(\tau |Z=0)+Q_{4}(\tau |Z=0),\int _0^\tau Q_{2}(t|Z=0)dt, \\&\hspace{0.6cm} Q_{2}(\tau |Z=1)+Q_{4}(\tau |Z=1),\int _0^\tau Q_{2}(t|Z=1)dt). \nonumber \end{aligned}$$
(6)
Estimation of \(\beta\) in this multi-state setting is carried out using pseudo-observations based on the Aalen–Johansen estimator.
Fig. 1
The extended illness-death model
Full size image
We assume that we observe independent, possibly right-censored realizations of the multi-state process \(X(\cdot )\). The observed data can be represented using counting processes \(N_{hji}(\cdot )\) for each individual i, where \(N_{hji}(t)\) counts the number of direct \(h\rightarrow j\) transitions up to time t. The observation window is restricted to \(t\le \tau _i\), the minimum of the time to absorption or censoring.
A non-parametric estimator for \(\text {P}(0,t)\), assuming a homogeneous group, is obtained by plug-in of the Nelson-Aalen estimator
$$\begin{aligned} \widehat{\varLambda }_{hj}(t)=\int _0^t \frac{\sum _i \text {d}N_{hji}(u)}{\sum _i Y_{hi}(u)}, \end{aligned}$$
and \(Y_{hi}(u)=1(X_{i}(u-)=h)\) indicates whether individual i is in state h at time u. The resulting estimator of the transition probability matrix is:
Full size image
known as the Aalen–Johansen estimator (Aalen and Johansen 1978), where
Full size image
denote the product integration. The estimated state occupation probabilities are given by the row vectors
Full size image
where \({\widehat{Q}}(0)\) is the empirical distribution of the initial state. Since \(\widehat{\varLambda }\) has jumps only at observed transition times, the product integral reduces to the ordinary matrix product
$$\begin{aligned} \prod _{u\in (0,t]}(I+\varDelta \widehat{\varLambda }(u)), \end{aligned}$$
where u corresponds to transition times. The expected time spent in state j up to time t, often referred to as the expected length of stay, is estimated as
$$\begin{aligned} \int _0^t {\hat{Q}}_j(u)\text {d}u. \end{aligned}$$
In the absence of censoring, the Aalen-Johansen-based estimate of the multi-state parameter \(\alpha _2\) reduces to the corresponding uncensored estimate in Eq. (4).
Variance estimation for the Aalen–Johansen estimator of \(\beta\), without assuming the Markov property, can be derived from its influence function (Glidden 2002). However, a more practical approach involves the use of pseudo-observations (Andersen et al. 2003), which serve as indirect estimates of the influence functions (Parner et al. 2023).
The pseudo-observation method is a flexible way to perform regression analysis for censored event data. The method relies on a well-defined estimator for the quantity of interest, \(\theta =\textrm{E}(V)\). Let \(\hat{\theta }\) represent the estimate of \(\theta\) based on the full sample \(X_1,\ldots , X_n\) and let \(\hat{\theta }^{(i)}\) denote the corresponding estimate obtained by leaving out observation, \(X_i\), i.e., from the sample \(X_1,\ldots ,X_{i-1},\) \(X_{i+1},\ldots ,X_{n}\). Here \(X_i\) denote the time-to-event data on subject i. The jack-knife pseudo-observation for the i-th observation is defined as
$$\begin{aligned} \hat{\theta }_{i} = n\hat{\theta }-(n-1)\hat{\theta }^{(i)}. \end{aligned}$$
Assume a regression model \(\textrm{E}(V_i|Z_i)=\mu (\beta _0;Z_i)\), where \(\mu (\beta ; Z_i) = \mu (\beta ^T Z_i)\) is typically the inverse of the link function in a generalized linear model. Let \(A(\beta ;Z_i)\) be a vector function depending only on the regression parameters and covariates. Estimates of \(\beta _0\) are then obtained based on \(\hat{\theta }_{1},\ldots , \hat{\theta }_{n}\) by solving an estimating equation
$$\begin{aligned} \sum _{i=1}^n A(\beta ; Z_i)\{\hat{\theta }_{i}-\mu (\beta ; Z_i )\} = 0. \end{aligned}$$
(7)
This equation corresponds to a generalized linear model where the pseudo-observation \(\hat{\theta }_{i}\) replaces the potentially unobserved \(V_i\). Pseudo-observations based on the Aalen–Johansen estimator have been shown to provide unbiased estimates when the censoring time is independent of both the multi-state process and the covariates (Overgaard et al. 2023). An alternative approach is given by the infinitesimal pseudo-observations, defined as \(\phi (F_n)+\dot{\phi }_{F_n}(X_i)\), where \(\phi (F_n)\) denotes the pooled estimator and \(\dot{\phi }_{F_n}(\cdot )\) the estimated influence function. These infinitesimal pseudo-observations share the same properties as the jackknife pseudo-observations (Parner et al. 2023). The infinitesimal pseudo-observations is the version implemented in the survival package in R. The variance of the infinitesimal pseudo-observations is an estimate of the variance in Glidden (2002). Pseudo-observations may be computed jointly across comparison groups or separately within each group. In the simulations and data applications, the pseudo-observations were computed separately within each group.
Let \(\hat{\theta }_{ij}\) denote the bivariate pseudo-observation for individual i within group j for
$$\begin{aligned} (Q_{2}(\tau |Z=j)+Q_{4}(\tau |Z=j),\int _0^\tau Q_{2}(t|Z=j)dt). \end{aligned}$$
We use the covariance of the pseudo-observations to estimate the covariance \((\hat{\beta }_1,\hat{\beta }_2)\) in group 0 and \((\hat{\beta }_3,\hat{\beta }_4)\) in group 1. The asymptotic variance of \(\hat{\alpha }_2\) again follows from the delta method, given by the form \(\sigma _0^2/n_0+\sigma _1^2/n_1\), where \(\sigma _0, \sigma _1\) are functions of the covariance of \(\hat{\theta }_{ij}\), \(\varSigma _j\) say. Further details are provided in the appendix.

3 Simulations

We investigate three simulation scenarios to assess the performance of the proposed estimator: (1) The small-sample properties of the proposed estimator; (2) The power of the proposed estimator in comparison to the approach of comparing the average time with the disease among all individuals; (3) Investigate the efficiency of the suggested estimator.
Scenario 1. We consider an extended illness-death model where all transitions follow a time-homogeneous Markov process, assuming constant hazard rates that are independent of event history. For the control group (standard follow-up), we use
  • The recurrence rate is \(\lambda _{12}(t|Z=0)=\lambda _{12}=0.116\) per year,
  • The mortality rate without recurrence is \(\lambda _{13}(t|Z=0)=\lambda _{13}=0.027\) per year,
  • The mortality with recurrence is \(\lambda _{24}(t|Z=0)=5\cdot \lambda _{13}(t|Z=0)\).
These rates correspond to a 2-year incidence of recurrence of 20% and a 2-year incidence of recurrence or death of 25%. These values were based on prior findings for vulvar cancer (Zach et al. 2021). Under this scenario, the \((\beta _1,\beta _2)=(20\%,2.3 \text { months})\).
For the treated group (intensive follow-up), we assume accelerated time-to-disease with factor \(a=50\%\). Thus, the disease rate becomes: \(\lambda _{12}(t|Z=1)=a^{-1}\cdot \lambda _{12}(a^{-1}t|Z=0)\). The other rates are assumed unchanged, i.e., \(\lambda _{13}(t|Z=1)=\lambda _{13}(t|Z=0)\) and \(\lambda _{24}(t|Z=1)=\lambda _{24}(t|Z=0)\). This yields \((\beta _3,\beta _4)=(36\%,4.2 \text { months})\), indicating an average increase of 1.9 months in time with disease under treatment.
The causal parameter \(\alpha _2\) represents the average increase in time with recurrence under intensive follow-up among those who would have been diagnosed with recurrence under intensive follow-up. Assuming the monotonicity condition (\(D(0)=1 \Rightarrow D(1)=1\), i.e., any recurrence detected under standard follow-up would also be detected under intensive follow-up), the difference in time with recurrence increases to 5.5 months for this subgroup.
Simulations are conducted with equal group sizes \(n_0=n_1=50, 100, 200, 500\) and censoring rate \(\lambda _c=0,0.1\) per year. Table 1 reports the observed proportion of censored event data, \(p_C\), the average estimate \(\hat{\alpha }_2\) (Ave \(\hat{\alpha }_2\)) and bias, \(\sqrt{n}\) times the standard deviation of \(\hat{\alpha }_2\) in months across 10,000 replications (\(\text {SD}_{\alpha _2}\)). The average standard error from the Huber–White robust variance estimator (\(\hbox {Se}_{\text {HW}}\)) and the coverage of the 95% confidence interval based on the robust variance (Coverage). The bias is small for \(n_0=n_1\ge 100\) in each group, with a valid normal approximation and variance estimate, and they remain acceptable for \(n_0=n_1= 50\).
Table 1
Small sample properties of the proposed estimate
\(n_0=n_1\)
\(\lambda _{C}\)
\(p_C\)
Ave \(\hat{\alpha }_2\)
Bias
\(\sqrt{n}\text {SD}_{\alpha _2}\)
\(\sqrt{n}\text {Se}_{\text {HW}}\)
Coverage
50
0.000
0.000
5.25
− 0.24
22.00
22.32
0.940
100
0.000
0.000
5.39
− 0.10
21.64
21.59
0.946
200
0.000
0.000
5.42
− 0.07
21.10
21.27
0.951
500
0.000
0.000
5.46
− 0.03
20.86
21.12
0.953
50
0.100
0.151
5.21
− 0.28
23.01
23.27
0.945
100
0.100
0.150
5.35
− 0.14
22.36
22.31
0.946
200
0.100
0.151
5.43
− 0.06
21.88
21.87
0.949
500
0.100
0.150
5.45
− 0.03
21.62
21.73
0.953
Scenario 2. This scenario compares the power of tests based on the proposed estimator \(\hat{\alpha }_2\) with an alternative estimator \(\hat{\alpha }_2^*\), which averages the time with recurrence across all individuals, regardless of whether recurrence occurred. Using the same transition model and parameters as in Scenario 1, we vary: the recurrence rate \(\lambda _{12}=0.116, 0.232\) per year and censoring rate \(\lambda _C=0,0.1,0.2\) per year. Table 2 summarizes the recurrence probability \(p_D\), the censoring proportion \(p_C\), the average Wald statistics w (for \(\hat{\alpha }_2\)) and \(w^*\) (for \(\hat{\alpha }_2^*\)), and the empirical power to reject the null hypothesis \(\alpha _2 = 0\) or \(\alpha _2^* = 0\) at the 5% level. All results are based on 10,000 simulations with \(n_0 = n_1 = 100\). The proposed estimator generally exhibits greater power than the conventional average treatment effect.
Table 2
The power for the hypothesis \(\alpha _2=0\) based on \(\hat{\alpha }_2\) and \(\hat{\alpha }_2^*\)
\(\lambda _{12}\)
\(\lambda _{C}\)
\(p_D\)
\(p_C\)
w
Power\((\alpha _2)\)
\(w^*\)
Power \((\alpha _2^*)\)
0.116
0.000
0.377
0.000
2.75
0.696
2.24
0.603
0.116
0.100
0.345
0.150
2.66
0.675
2.16
0.579
0.116
0.200
0.316
0.276
2.58
0.656
2.09
0.551
0.232
0.000
0.675
0.000
3.41
0.872
2.97
0.834
0.232
0.100
0.619
0.130
3.32
0.850
2.89
0.813
0.232
0.200
0.570
0.239
3.24
0.838
2.80
0.796
Scenario 3. In this scenario, we investigate the efficiency of the proposed estimator. We consider cases where the event of interest occurs with probability \(p_0=0.10,0.20, \ldots ,0.50\) in the control group (group 0) and \(p_1=0.10,0.20,\ldots , 0.60\) in treatment group (group 1) with the restriction \(p_0\le p_1\) (i.e., the event occurs at least as often in the treatment group). The time to the event of interest is assumed to follow a log-normal distribution with median 2 in group 0, and medians of 1 or 1.5 in group 1, with a standard deviation of 0.5 in both groups. This corresponds to an accelerated failure time model. The study is analyzed at time \(\tau =2\), such that \(\beta _3=p_1/2\). Figure 2 displays the ratio of \(\alpha _2\) to \(\sqrt{\sigma _0^2+\sigma _1^2}\) - that is, the Z-statistic based on a single observation in each comparison group - and the estimated required sample size to reject the null hypothesis of no group difference with 80% power. The \(\beta\) parameters and \(\sigma _0, \sigma _1\) are estimated from 1,000,000 simulations. The resulting sample size requirements appear feasible for many clinical studies across a wide range of scenarios.
Fig. 2
Expected Z-statistics for a single observation in each comparison group, along with the required sample size to reject the null hypothesis of no group difference with 80% power
Full size image

4 Data examples

4.1 Example 1: Sample size calculation

The vulvar cancer study (DaVulvaRec) is ongoing and is expected to conclude in 2030. Based on the model assumptions outlined in Simulation Scenario 1, we illustrate a sample size calculation and conduct the final analysis using a simulated dataset with 200 participants per group, each observed over a complete 2-year follow-up. From a clinical standpoint, the monotonicity condition is highly plausible: any recurrence identified under standard follow-up would also be detected under intensive follow-up. Accordingly, the principal stratum is defined as patients who experience recurrence during intensive follow-up.
For the sample size calculation, a simulation of a large sample with 1,000,000 replications in each group yields a causal difference of 5.5 months, with standard deviations of \(\sigma _0=15.4\) months and \(\sigma _1=14.3\) months. Assuming the study aims to demonstrate that patients who experience recurrence under intensive follow-up gain at least 1 additional month compared to standard follow-up, a total of 172 patients per group would be required to achieve 80% power using a one-sided z-test at the 2.5% significance level.
At 2 years, the estimated recurrence rate in the simulated data was 38% (95% CI 32–45%) in the intensive follow-up group and 17% (12–22%) in the standard follow-up group. The average duration of disease recurrence at 2 years was 4.6 months (95% CI 3.6\(-\)5.6) in the intensive follow-up group and 2.0 months (95% CI 1.2\(-\)2.7) in the standard follow-up group, yielding a difference of 2.6 months (95% CI 1.3\(-\)3.9). This implies that among patients who experienced recurrence under intensive follow-up, the estimated additional time spent with disease compared with standard follow-up was 6.8 months (95% CI 4.1\(-\)9.5), favoring the intensive follow-up group.

4.2 Example 2: Colon cancer

The data originate from a randomized clinical trial evaluating adjuvant chemotherapy for colon cancer. Levamisole, a compound with low toxicity originally used to treat parasitic infections in animals, was one of the treatment agents. The second agent, 5-fluorouracil (5-FU), is a moderately toxic chemotherapy drug. Patients were randomly assigned to one of three groups: observation only, levamisole alone (administered orally at 50 mg three times daily for 3 days, repeated every 2 weeks for one year), or a combination of levamisole with 5-FU (450 mg/\(\hbox {m}^2\) administered intravenously for 5 consecutive days, followed by weekly administration starting on day 28 for 48 weeks). The primary outcomes were cancer recurrence and mortality. Baseline covariates were also collected. For additional details, see Moertel et al. (1995). The dataset is available via the survival package in R.
In this analysis, we compare two groups: those assigned to Observation Only (coded as \(Z=~0\)) and those assigned to the Lev+5FU treatment (coded as \(Z=1\)). The treatment is intended to reduce the recurrence rate. Since the control group shows a higher recurrence rate, it now serves as the principal stratum. This correspond to the roles of 0 and 1 reversed as compared to the vulvar cancer example. The analysis focuses on comparing the average duration of time with cancer recurrence over a 7-year follow-up horizon among patients who would have experienced recurrence under the Observation Only condition. The time point was chosen near the largest follow-up. The monotonicity assumption states that if a patient experiences recurrence under Lev+5FU (\(D(1) = 1\)), then the patient would also experience recurrence under Observation Only (\(D(0) = 1\)). The monotonicity assumption appears largely reasonable. It provides a lower bound for the causal effect. Both the plausibility of monotonicity and the degree of deviation—captured by a sensitivity parameter—can be assessed by subject-matter experts (Shepherd et al. 2007). We further illustrate the potential impact of violations of this assumption. Although the recurrence rate may be higher in the Observation Only group than in the Lev+5FU group, the disease duration does not necessarily have to be longer, due to the competing risk of death within the principal stratum. Consequently, the causal effect may not be positive, which is one of the key motivations for investigating the principal stratum.
Fig. 3
State occupation probabilities with 95% confidence intervals
Full size image
A total of 315 patients were randomized to the Observation Only group and 304 to the Lev+5FU group. The median follow-up duration was 55.7 months, with 38% of patients censored before 7 years follow-up. The Aalen-Johansen estimate of the state occupation probabilities is shown in Fig. 3. At 7 years, the recurrence rate was estimated at 57% (95% CI 51–62%) in the Observation Only group and 39% (34–45%) in the Lev+5FU group. The risk of death with recurrence was seen to be higher in the Observation Only group as compared to the Lev+5FU group, consistent with the monotonicity assumption (Fig. 3).
The average duration of disease recurrence is the expected length of stay of the Recurrence state given by the integral of the Recurrence state occupation probability curve (Fig. 3). The average duration of disease recurrence at 7 years is 10.7 months (95% CI 9.0\(-\)12.5) among Observation Only patients and 5.9 months (95% CI 4.5\(-\)7.3) among treated patients, with a difference of 4.8 months (95% CI 2.6\(-\)7.0). This suggests that among patients who experience recurrence under Observation Only, the estimated difference in time spent with disease between the Observation Only and treatment groups is 8.5 months (95% CI 4.8\(-\)12.1), favoring the treatment group. In a sensitivity analysis, assuming that the sensitivity parameter \(\gamma =0.90\), the estimated difference in time spent with disease between the Observation Only and treatment groups increases to 9.5 months (95% CI 6.1\(-\)13.0).

5 Discussion

In clinical studies where an intervention is designed to benefit a specific subgroup of patients, it is important to quantify the causal effect within that subgroup—referred to as a principal stratum. In this paper, we considered the average treatment effect for a duration outcome within the principal stratum of individuals who would experience a positive duration under one treatment. We demonstrated how this causal effect can be identified from observational data under a monotonicity assumption.
Furthermore, we showed that hypothesis testing based on the causal effect in the principal stratum offers greater statistical power than comparisons based on the overall treated and control groups. For censored duration outcomes, we illustrated how the causal effect can be estimated using multi-state models in conjunction with pseudo-observations to handle censoring.
If the conditional exchangeability assumption https://static-content.springer.com/image/art%3A10.1007%2Fs10985-025-09681-y/MediaObjects/10985_2025_9681_Fige_HTML.png holds for a set of covariates L, then—as shown in the appendix—the causal effect under monotonicity can be identified as
$$\begin{aligned} \alpha_2= \frac{\text{E}\big(\text{E}(T|L,Z=1)-\gamma(L)\cdot\text{E}(T|L,Z=0)\big)}{\text{E}\big(\text{P}(D=1|L,Z=1)\big)} .\end{aligned}$$
This identification strategy requires regression models for both \(\text {P}(D=1|Z=1,L)\) and \(\text {E}(T|Z,L)\). Alternatively, noting that \(\text{E}(T|L,Z)=\text{E}(T|L,Z,D=1)\cdot\text{P}(D=1|L,Z)\), the causal effect can instead be expressed using models for \(\text {P}(D=1 |Z=1, L)\) and \(\text {E}[T | Z,L, D=1]\). As with the suggested multi-state model approach, pseudo-observations can be employed to account for censoring, and the variance of the resulting estimator can be derived following standard techniques (Newey and McFadden 1994). In some settings, this formulation may improve efficiency compared with the estimator presented in this manuscript when the unconditional exchangeability assumption (
Full size image
) holds. However, there is a trade-off: including models for \(\text {P}(D=1|Z=1, L)\) will generally increase variance, while modeling \(\text {E}(T|Z,L)\) or \(\text {E}(T|Z,L,D=1)\) may reduce it. In Simulation Scenario 1, when including an additional disease severity covariate with "Low" or "High" levels occurring at 50% frequency, applying models for \(\text {P}(D=1|Z=1, L)\) and \(\text {E}[T|Z, L]\) resulted in a slight increase in the variance of the \(\alpha _2\) estimator (data not shown). Nevertheless, this alternative approach and its detailed implementation are beyond the scope of the present paper.

Acknowledgements

We gratefully acknowledge the constructive comments from the referee. In particular, the referee noted that Eq. (1) holds if and only if the condition \(\text {P}(D(1) = 1 | D(0) = 1) = 1\) is satisfied, and suggested that this property can be used to define a sensitivity parameter.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Download
Title
Estimating treatment effects on duration with disease: a principal stratification framework
Author
Erik T. Parner
Publication date
01-03-2026
Publisher
Springer US
Published in
Lifetime Data Analysis / Issue 1/2026
Print ISSN: 1380-7870
Electronic ISSN: 1572-9249
DOI
https://doi.org/10.1007/s10985-025-09681-y

Proof of Proposition 1

Assume the exchangeability assumption
Full size image
holds. First note that exchangeability implies conditional exchangeability given potential outcome since
$$\begin{aligned} \text {P}(T(0),T(1),D(1)) = \text {P}(T(0),T(1), D(1)|Z) \end{aligned}$$
implies
$$\begin{aligned} \text {P}(T(0),T(1)|D(1))\text {P}(D(1)) = \text {P}(T(0),T(1)| D(1),Z)\text {P}(D(1)|Z) \end{aligned}$$
and
$$\begin{aligned} \text {P}(T(0),T(1)|D(1)) = \text {P}(T(0),T(1)|D(1),Z). \end{aligned}$$
Consider now
$$\begin{aligned}&\text {E}(T(1)-T(0)| D(1)=1) \\&=\text {E}(T(1)-T(0)|D(1)=1, Z=1) \text { (exchangeability)} \\&=\text {E}(T-T(0)|D=1, Z=1) \text { (consistency)} \\&=\text {E}(T|D=1, Z=1)-\text {E}(T(0)|D=1, Z=1) \\&=\frac{1}{\text {P}(D=1|Z=1)}\left\{ \text {E}(T1(D=1)|Z=1) -\text {E}(T(0)1(D(1)=1)|Z=1)\right\} \text { (consistency)} \\&=\frac{1}{\text {P}(D=1|Z=1)}\left\{ \text {E}(T|Z=1) -\frac{\text {E}(T(0)1(D(1)=1))}{\text {E}(T(0))}\cdot \text {E}(T(0))\right\} \text { (exchangeability)} \\&=\frac{1}{\text {P}(D=1|Z=1)}\left\{ \text {E}(T|Z=1) -\gamma \cdot \text {E}(T(0)|Z=0) \right\} \text { (exchangeability)} \\&=\frac{1}{\text {P}(D=1|Z=1)}\left\{ \text {E}(T|Z=1) -\gamma \cdot \text {E}(T|Z=0) \right\} \text { (consistency)}, \end{aligned}$$
where \(\gamma\) is defined as
$$\begin{aligned} \gamma&:= \frac{\text {E}(T(0)1(D(1)=1))}{\text {E}(T(0))} \\&=\frac{\text {E}(T(0)|D(1)=1)}{\text {E}(T(0)|D(0)=1)}\cdot \frac{\text {P}(D=1|Z=1)}{\text {P}(D=1|Z=0)}. \end{aligned}$$
Note that \(\gamma \le 1\). Under monotonicity (\(D(0)=1 \Rightarrow D(1)=1\)), we have \(T(0)1(D(1)=1)=T(0)\), which directly implies \(\gamma =1\). More generally, \(\gamma\) can be treated as a sensitivity parameter to assess how violations of monotonicity affect the results.
If, instead, the conditional exchangeability assumption
Full size image
holds for a set of covariates L, then we can express
$$ \begin{aligned} \text{E}(T(1)-T(0)|D(1)=1) \\& = \frac{\text{E}\big(\text{E}(T|L,Z=1)-\gamma(L)\cdot\text{E}(T|L,Z=0)\big)}{\text{E}\big(\text{P}(D=1|L,Z=1)\big)} \\& = \frac{\text{E}\big(\text{E}(T|L,Z=1,D=1)\text{P}(D=1|L,Z=1)-\gamma(L)\cdot\text{E}(T|L,Z=0,D=1)\text{P}(D=1|L,Z=0)\big)}{\text{E}\big(\text{P}(D=1|L,Z=1)\big)}, \end{aligned}$$
where the function \(\gamma (L)\) is defined as
$$\begin{aligned} \gamma (L)&:= \frac{\text {E}(T(0)1(D(1)=1)|L)}{\text {E}(T(0)|L)} . \end{aligned}$$
Under the monotonicity condition, we have \(\gamma (L)=1\).

Asymptotic variance of \(\hat{\alpha }_2\)

The asymptotic variance of \(\hat{\beta }\) is of the form
$$\begin{aligned} \varSigma = \begin{bmatrix} \varSigma _0/n_0 & 0 \\ 0 & \varSigma _1/n_1 \\ \end{bmatrix}. \end{aligned}$$
It follows from the delta method of the function \(g(\beta )=\beta _{3}^{-1}(\beta _{4}-\beta _{2})\), with derivative
$$\begin{aligned} g^\prime (\beta )=(0,-\frac{1}{\beta _3},-\frac{1}{\beta _3^2}(\beta _4-\beta _2), \frac{1}{\beta _3}), \end{aligned}$$
that the asymptotic variance of \(\hat{\alpha }_2\) is
$$\begin{aligned} \sigma _0^2/n_0 + \sigma _1^2/n_1&= (0,-\frac{1}{\beta _3}) \varSigma _0 (0,-\frac{1}{\beta _3})^T/n_0 + (-\frac{1}{\beta _3^2}(\beta _4-\beta _2), \frac{1}{\beta _3}) \varSigma _1 (-\frac{1}{\beta _3^2}(\beta _4-\beta _2), \frac{1}{\beta _3})^T/n_1 . \end{aligned}$$
In the uncensored case
$$\begin{aligned} \varSigma _0&= \begin{bmatrix} \beta _{1}(1-\beta _{1}) & \beta _{2}(1-\beta _{1}) \\ \beta _{2}(1-\beta _{1}) & \text {Var}(T|Z=0) \\ \end{bmatrix} \\ \varSigma _1&= \begin{bmatrix} \beta _{3}(1-\beta _{3}) & \beta _{4}(1-\beta _{3}) \\ \beta _{4}(1-\beta _{3}) & \text {Var}(T|Z=1) \\ \end{bmatrix}. \end{aligned}$$
implying
$$\begin{aligned} \sigma _0^2&= \frac{1}{\beta _{3}^2} \text {Var}(T|Z=0) \\ \sigma _1^2&=\frac{1}{\beta _{3}^2}\text {Var}(T|Z=1) -\frac{1}{\beta _{3}^3}(1-\beta _{3})(\beta _{4}^2-\beta _{2}^2) . \end{aligned}$$
go back to reference Aalen OO, Johansen S (1978) An empirical transition matrix for non-homogeneous Markov chains based on censored observations. Scand J Stat 5:141–150MathSciNet
go back to reference Andersen PK, Klein JP, Rosthøj S (2003) Generalised linear models for correlated pseudo-observations, with applications to multi-state models. Biometrika 90(1):15–27. https://doi.org/10.1093/biomet/90.1.15MathSciNetCrossRef
go back to reference Frangakis CE, Rubin DB (2002) Principal stratification in causal inference. Biometrics 58(1):21–29. https://doi.org/10.1111/j.0006-341x.2002.00021.xMathSciNetCrossRef
go back to reference Glidden DV (2002) Robust inference for event probabilities with non-Markov event data. Biometrics 58(2):361–368. https://doi.org/10.1111/j.0006-341x.2002.00361.xMathSciNetCrossRef
go back to reference Harrell F (2021) Assessing heterogeneity of treatment effect, estimating patient-specific efficacy, and studying variation in odds ratios, risk ratios, and risk differences. https://www.fharrell.com/post/varyor/. Accessed 19 Jun 2024
go back to reference Hauck WW, Anderson S, Marcus SM (1998) Should we adjust for covariates in nonlinear regression analyses of randomized trials? Control Clin Trials 19(3):249–256. https://doi.org/10.1016/s0197-2456(97)00147-5CrossRef
go back to reference Hernán MA, Robins JM (2020) Causal inference: what if. Chapman & Hall/CRC, Boca Raton
go back to reference ICH E9 (R1) (2020) Ich e9(r1) addendum on estimands and sensitivity analysis in clinical trials to the guideline on statistical principles for clinical trials. EMA/CHMP/ICH/436221/2017, Step 5
go back to reference Liang KY, Zeger SL (1986) Longitudinal data analysis using generalized linear models. Biometrika 73(1):13–22. https://doi.org/10.1093/biomet/73.1.13MathSciNetCrossRef
go back to reference Moertel CG, Fleming TR, Macdonald JS, Haller DG, Laurie JA, Tangen CM, Ungerleider JS, Emerson WA, Tormey DC, Glick JH et al (1995) Fluorouracil plus levamisole as effective adjuvant therapy after resection of stage III colon carcinoma: a final report. Ann Intern Med 122(5):321–326. https://doi.org/10.7326/0003-4819-122-5-199503010-00001CrossRef
go back to reference Newey WK, McFadden D (1994) Chapter 36: Large sample estimation and hypothesis testing. Handbook of econometrics, vol 4. Elsevier, Amsterdam. https://doi.org/10.1016/S1573-4412(05)80005-4CrossRef
go back to reference Overgaard M, Andersen PK, Parner ET (2023) Pseudo-observations in a multistate setting. Stand Genomic Sci 23(2):491–517. https://doi.org/10.1177/1536867X231175332CrossRef
go back to reference Parner ET, Andersen PK, Overgaard M (2023) Regression models for censored time-to-event data using infinitesimal jack-knife pseudo-observations, with applications to left-truncation. Lifetime Data Anal 29(3):654–671. https://doi.org/10.1007/s10985-023-09597-5MathSciNetCrossRef
go back to reference Shepherd BE, Gilbert PB, Mehrotra DV (2007) Eliciting a counterfactual sensitivity parameter. Am Stat 61(1):56–63MathSciNetCrossRef
go back to reference Zach D, Åvall-Lundqvist E, Falconer H, Hellman K, Johansson H, Rådestad AF (2021) Patterns of recurrence and survival in vulvar cancer: a nationwide population-based study. Gynecol Oncol 161(3):748–754. https://doi.org/10.1016/j.ygyno.2021.03.013CrossRef
Image Credits
Salesforce.com Germany GmbH/© Salesforce.com Germany GmbH, IDW Verlag GmbH/© IDW Verlag GmbH, Diebold Nixdorf/© Diebold Nixdorf, Ratiodata SE/© Ratiodata SE, msg for banking ag/© msg for banking ag, Governikus GmbH & Co. KG/© Governikus GmbH & Co. KG, Horn & Company GmbH/© Horn & Company GmbH, EURO Kartensysteme GmbH/© EURO Kartensysteme GmbH, Jabatix S.A./© Jabatix S.A.