Der Artikel untersucht die Integration von hochdimensionalen randomisierten kontrollierten Studien (RCTs) und Daten aus der realen Welt (RWD), um die Abschätzung heterogener Behandlungseffekte (HTE) in Szenarien zu verbessern, die Zensur und versteckte Verwirrung beinhalten. Es führt einen innovativen integrativen Regressionsansatz ein, der gleichzeitig Parameter schätzt, wichtige Variablen auswählt und unkontrollierte Störeffekte identifiziert. Die Methode nutzt die Stärken sowohl der RCTs als auch der RWD und adressiert die Beschränkungen traditioneller Ansätze, die häufig mit hochdimensionalen Daten und Zensur zu kämpfen haben. Durch rigorose theoretische Analysen und praktische Anwendungen zeigt der Artikel die überlegene Effizienz und Genauigkeit der vorgeschlagenen Methode bei der Schätzung von HTE, insbesondere in komplexen und realistischen medizinischen Forschungssituationen. Die Studie unterstreicht das Potenzial der Integration unterschiedlicher Datenquellen, um die Präzision der Abschätzung von Behandlungseffekten zu verbessern und den Weg für personalisiertere und effektivere medizinische Interventionen zu ebnen. Die Anwendung der vorgeschlagenen Methode auf reale Daten aus einer Lungenkrebsstudie unterstreicht ihre praktische Relevanz und möglichen Auswirkungen auf die klinische Entscheidungsfindung.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
In this study, we focus on estimating the heterogeneous treatment effect (HTE) for survival outcome. The outcome is subject to censoring and the number of covariates is high-dimensional. We utilize data from both the randomized controlled trial (RCT), considered as the gold standard, and real-world data (RWD), possibly affected by hidden confounding factors. To achieve a more efficient HTE estimate, such integrative analysis requires great insight into the data generation mechanism, particularly the accurate characterization of unmeasured confounding effects/bias. With this aim, we propose a penalized-regression-based integrative approach that allows for the simultaneous estimation of parameters, selection of variables, and identification of the existence of unmeasured confounding effects. The consistency, asymptotic normality, and efficiency gains are rigorously established for the proposed estimate. Finally, we apply the proposed method to estimate the HTE of lobar/sublobar resection on the survival of lung cancer patients. The RCT is a multicenter non-inferiority randomized phase 3 trial, and the RWD comes from a clinical oncology cancer registry in the United States. The analysis reveals that the unmeasured confounding exists and the integrative approach does enhance the efficiency for the HTE estimation.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
Recently, there has been a growing focus on the heterogeneity of treatment effect (HTE), a vital path towards personalized medicine (Hamburg and Collins 2010; Collins and Varmus 2015). Accommodating confounding effects is crucial for obtaining well-estimated HTE. In such comparative medical research, it is important but challenging to fully determine what causes confounding effects and measure all of them. The most common approach is to conduct randomized controlled trials (RCTs). RCTs are known as the gold standard for assessing the causal effect of an intervention or treatment on the outcome of interest. The randomization allows the distribution of covariates in different groups to be balanced. However, RCTs have major downsides. For instance, they are costly and time-consuming, and often an inadequate sample size may result from recruitment challenges.
On the other hand, the increasing availability of real-world data (RWD) for research purposes, including electronic health records and disease registries, offers a broader demographic and diversity than RCTs. RWD provides abundant additional evidence to support HTE. Under the assumption that the records in RWD contain all the confounders, many approaches to harmonizing evidence from RCTs and RWD for HTE estimation have been developed, ranging from classic methods such as regression-based and inverse probability weighting to more recent machine learning models like neural networks (Shalit et al. 2017) and random forests (Wager and Athey 2018). Inspired by Robinson transformation (Robinson 1988), Nie and Wager (2021) recently proposed an R-learner to estimate HTE. The R-learner possesses the property of Neyman orthogonality (Neyman 1959), enabling the integration of more extensive and flexible machine-learning methods for estimating the nuisance functions. However, it is always possible that in uncontrolled real-world settings, important confounders may be overlooked or unmeasured. For instance, doctors assign treatment based on patient’s symptoms that are not documented in the medical chart. Unmeasured confounding can lead to unidentifiable causal effects of interest and result in distorted estimates of HTE.
Anzeige
Classical approaches, such as instrumental variable methods (Angrist et al. 1996), negative controls (Kuroki and Pearl 2014), and sensitivity analysis (Robins et al. 1999), have been proposed to address biases caused by hidden confounding. In recent years, a promising strategy to overcome the challenges posed by hidden confounding is to characterize the confounding function in RWD, and then utilize RCTs to identify both the HTE and confounding function. Drawing upon this idea, Kallus et al. (2018) proposed a regression-based method to estimate HTE. Yang et al. (2020b) established the semiparametric efficient score function to estimate the HTE and confounding function and demonstrated that their method can not only address issues arising from hidden confounding but also enhance the efficiency of HTE estimates. Additionally, they introduced a testing procedure to ascertain the presence of unmeasured confounding, which informs the decision on whether to integrate RWD for a joint analysis (Yang et al. 2023). However, once unmeasured confounding is detected, their approach discards all RWD data. More recently, Wu and Yang (2022) leveraged the benefits of the R-learner to develop an integrative method for estimating the HTE and confounding function, utilizing experimental data for model identification and observational data for efficiency boosting.
However, the approaches mentioned above are all limited to fully observed data and low-dimensional covariates. With the ongoing advancements in data acquisition technology and cloud storage, there is a growing trend towards the collection of high-dimensional data. Censoring frequently occurs in various fields, especially for survival data, where the exact time of the event of interest cannot be observed due to the limited duration of the study. Literature on estimating HTE from high-dimensional or censored data typically assumes the ideal case with no hidden confounding. Ma and Zhou (2018) proposed characterizing the hazard ratio to mimic the heterogeneous treatment effect, yet they did not take into account high-dimensional covariates. Zhu and Gallego (2020) used the difference in survival functions to describe heterogeneous treatment effects. Hu et al. (2021) utilized the difference in survival quantile to characterize the survival treatment effect at the individual level and adopted a machine learning approach for model estimation. Zhou and Zhu (2021) applied the sufficient dimension reduction technique to high-dimensional data without censoring. To our knowledge, the literature on estimating HTE from high-dimensional censored data while offsetting the unmeasured confounding effect remains scarce.
In this paper, we focus on improving the estimate of HTE for a survival outcome by integrating high-dimensional censored RCTs and RWD data, particularly in situations where unmeasured confounding may exist. We propose an integrative regression approach to simultaneously estimate parameters, select important variables, and determine the presence of unmeasured confounding effects. The proposed method assumes the transportability of the HTE. Therefore, the RCTs can be utilized to identify the HTE in RWD. Both the HTE and confounding function can be estimated through regularized weighted least square regression to accommodate censoring. The proposed method possesses the property of Neyman orthogonality, making it possible to adopt flexible machine-learning methods for the estimations of the nuisance functions. Theoretical properties are rigorously established, including estimation consistency, variable selection consistency, and asymptotic normality. We demonstrate that the proposed integrative method results in a more efficient HTE estimate, at least on par with estimates solely based on RCTs data. When there is unmeasured confounding, instead of excluding all data from RWD, the proposed method can still make use of the RWD data in some cases. This study has the potential to enhance the existing literature in multiple important aspects. First, an integrative analysis to include high-dimensional censored RWD data in HTE estimation is conducted, which can be more challenging than analyzing low-dimensional completely observable data. Secondly, the proposed approach permits the presence of unmeasured confounding, which is more flexible and complements the analysis that assumes the unconfoundness in RWD. Thirdly, the proposed approach can identify whether the unmeasured confounding effect exists in a fully data-driven manner. This can contribute to more accurate estimates and lead to a deeper understanding of the data generation mechanism. Lastly, and equally importantly, this study offers a valuable practical tool for addressing a wide range of scientific issues. In particular, we apply the proposed integrative approach to improve the estimate of HTE on overall survival for patients with early-stage non-small-cell lung cancer undergoing lobar resection and limited resection, which convincingly demonstrates the usefulness of the proposed method.
The remaining part of the paper is organized as follows. In Sect. 2, we introduce the proposed method. Theoretical properties are provided in Sect. 3. Numerical studies are conducted in Sect. 4, and application to real data is presented in Sect. 5. Concluding remarks are given in Sect. 6. Technical details are given in the Appendix.
Anzeige
2 Methods
Let \(\widetilde{T}\) be the failure time, \(C\) be the censored time, \(T=\min (\widetilde{T},C)\) be the observation with censoring indicator \(\delta =I(\widetilde{T}\le C)\), and \(A=0,1\) be the binary treatment variable. Let \({\textbf {X}}\in \mathbb {R}^p\) be the covariates vector, which includes the intercept term \(X_0\equiv 1\). Let \(S\) denote the data source, taking the value of 0 for RWD and 1 for RCT. The sample size of RCT is \(n_1\) and RWD is \(n_0\). Let the observed data be \(\mathcal{O}\mathcal{B}=\left\{ \mathcal{O}\mathcal{B}_i,i=1,2,...,n=n_1+n_0\right\}\), where \(\mathcal{O}\mathcal{B}_i=(T_i,\delta _i,{\textbf {X}}_i,A_i,S_i).\) Under the potential outcome framework, denote that \(\widetilde{T}(a)\), \(C(a)\) and \(T(a)=\min \left\{ \widetilde{T}(a),C(a)\right\}\) be the potential failure time, potential censored time and potential observed time under treatment \(a \in \left\{ 0,1\right\}\), respectively. We aim to evaluate the heterogeneous treatment effect (HTE) defined as follows
The definition of the HTE aligns seamlessly with conventional survival models, as illustrated, e.g., in (1) and (2). The basic assumptions for modelling are as follows:
\({\textbf {(A0)}}\)
(i) \(\widetilde{T}=A\widetilde{T}(1)+(1-A)\widetilde{T}(0)\), \(C=AC(1)+(1-A)C(0)\), and \(T=AT(1)+(1-A)T(0)\).
(ii) \(\widetilde{T}(a)\perp A|({\textbf {X}},S=1)\), \(a\in \left\{ 0,1\right\}\).
(i) assumes that the consistency between observation and potential outcome holds.
(ii) holds for the RCT by default. (iii) states that the HTE is the same for the trial participants and the patient population at large. It holds that if trial participants are randomly recruited for each subgroup of X, or the exclusion criteria of trial participation do not affect the treatment response.
Define \(\mu _a({\textbf {X}},S)=\mathbb {E}\left\{ \log (\widetilde{T})|A=a,S,{\textbf {X}}\right\}\), \(a=0,1\). By assumption, it can be seen that for RCT, \(\mu _1({\textbf {X}},S=1)-\mu _0({\textbf {X}},S=1)=\tau ({\textbf {X}})\). However, this equation may not hold in RWD if unmeasured confounding exists. Define the confounding function \(u_c({\textbf {X}})=\mu _1({\textbf {X}},S=0)-\mu _0({\textbf {X}},S=0)-\tau ({\textbf {X}})\). It can be seen that \(u_c({\textbf {X}})\) captures the unmeasured confounding effect. The above formulations can be summarized into
where \(\mathbb {E}(\epsilon |{\textbf {X}},A,S)=0\), and \(\mathbb {E}(\epsilon ^2|{\textbf {X}},A,S)\) is finite. Taking expectation conditional on \(({\textbf {X}},S)\) on both sides of this model leads
where \(\mu ({\textbf {X}},S)=\mathbb {E}\left\{ \log (\widetilde{T})|S,{\textbf {X}}\right\}\), \(\mathbb {E}(\epsilon |{\textbf {X}},A,S)=0\), and \(\mathbb {E}(\epsilon ^2|{\textbf {X}},A,S)<\infty\). Based on assumption \({\textbf {(A0)}}\), the above-induced formulation (3) is an accelerated failure time (AFT) model. AFT model is a natural choice for clinical decision-making, because it has an intuitive regression interpretation on failure time. There is rich literature considering the AFT model for observational studies (Henderson et al. 2020; Hu et al. 2021; Simoneau et al. 2020; Yang et al. 2020a). Estimation of the AFT model with an unspecified error distribution has been studied extensively. Here, we adopt the weighted least squares (LS) approach (Stute 1993) which is computationally more feasible.
[Style2 Style3 Style3]Remark 2
More generally, instead of a logarithmic transformation on failure time, any other known monotone transformation can be considered. Then, the definition of HTE and assumption \({\textbf {(A0)}}\) should be correspondingly modified.
In (3), we aim to estimate the HTE \(\tau\) and the confounding function \(u_c\), with \(e\) and \(\mu\) being the nuisance functions. First, we make the following assumptions for modelling heterogeneous treatment effects and unmeasured confounding effects.
\({\textbf {(M0)}}\)
\(u_c({\textbf {X}})\) can be modelled by \({\textbf {X}}^{\textrm{T}}\varvec{\beta }\), and \(\tau ({\textbf {X}})\) can be modelled by \({\textbf {X}}^{\textrm{T}}\varvec{\alpha }\), where \(\varvec{\beta }\), \(\varvec{\alpha }\in \mathbb {R}^p\).
Define the parameters of interests be \(\varvec{\theta }=(\varvec{\alpha }^{\textrm{T}},\varvec{\beta }^{\textrm{T}})^{\textrm{T}}\); nuisance functions be \(\eta =(e,\mu )\). Let \({\textbf {Z}}=({\textbf {X}},S)\), and \({\textbf {U}}=({\textbf {X}},(1-S){\textbf {X}})\). Then, the weighted loss function is
where \(T_{(1)}\le T_{(2)}\le ...\le T_{(n)}\), and \({\textbf {Z}}_{(i)}\), \({\textbf {U}}_{(i)}\) are in corresponding order, \(w_{i}\) is defined as follows
Suppose that the nuisance function \(\eta\) can be pre-estimated, then we propose to use the following penalized regression to get the estimate. It can simultaneously select important variables and determine whether the unmeasured confounding exists:
where \(\rho (t;\lambda )\) is a penalty function with tuning parameters \(\lambda >0\) to recover sparsity, various kinds of penalty functions can be used to derive sparse and unbiased estimates, such as adaptive Lasso (Zou 2006), SCAD (Fan and Peng 2004), and MCP (Zhang 2010). It can be seen that the penalty function in (5) consists of two parts with tuning parameters \(\lambda _1\), \(\lambda _2\) respectively. The first part corresponds with the parameter of HTE, i.e., \(\varvec{\alpha }\), and the second part corresponds with the parameter of unmeasured confoundings, i.e., \(\varvec{\beta }\). By adopting penalties respectively, the method can fit in with a more general case where the sizes of coefficient in HTE and confounding function are different. A Similar strategy can be found in Cheng et al. (2023). The final estimate can be written as
where the tuning parameters \(\lambda _1\) and \(\lambda _2\) can be selected by criteria such as AIC, BIC, and cross-validation (CV).
3 Theoretical properties
Denote that the true parameters be \(\varvec{\theta }^*=(\varvec{\alpha }^{*{\textrm{T}}},\varvec{\beta }^{*{\textrm{T}}})^{\textrm{T}}\), and true nuisance functions be \(\eta ^*\). Define index sets of non-zero parameters as follows: \(\mathcal {D}=\left\{ 1\le j\le 2p|\theta ^*_j\ne 0\right\}\) with element number \(d_{n}\), \(\mathcal {D}_1=\left\{ 1\le j\le p|\alpha ^*_j\ne 0\right\}\) with element number \(d_{1n}\), \(\mathcal {D}_2=\left\{ 1\le j\le p|\beta ^*_j\ne 0\right\}\) with element number \(d_{2n}\). Following the notations in Stute (1996), let \(G\) be the probability distribution function (p.d.f) of \(C\), with \(\tau _{G}=\inf \left\{ x:G(x)=1\right\}\), F be the p.d.f of \(\widetilde{T}\), with \(\tau _{F}=\inf \left\{ x:F(x)=1\right\}\), and \(H\) be the p.d.f of \(T\), with \(\tau _H=\inf \left\{ x:H(x)=1\right\}\). Let \(F^0\in \mathcal {P}\) be the p.d.f of \((\widetilde{{\textbf {Z}}},\widetilde{T})\), where \(\widetilde{{\textbf {Z}}}=({\textbf {Z}},A)\). Define
where \(j=1,2,...,2p\). It can be seen that \(\mathbb {E}\varvec{\phi }(\widetilde{{\textbf {Z}}},\widetilde{T};\varvec{\theta }^*,\eta ^*)={\textbf {0}}\), where \(\varvec{\phi }=\left( \phi _j,j=1,2,...,2p\right)\). Define \(\gamma _0(y)=\exp \left\{ \int ^{y-}_0\left\{ 1-H(v)\right\} ^{-1}\widetilde{H}^0(dv)\right\}\), where \(\widetilde{H}^{0}(y)=\Pr (T\le y, \delta =0)\).
(i) The eigenvalues of \(\mathbb {E}\left[ \left\{ A-e^*({\textbf {Z}})\right\} ^2{\textbf {U}}{\textbf {U}}^{\textrm{T}}\right]\) are larger than a positive constant \(c_1\).
(ii) The eigenvalues of \(\mathbb {E}{\textbf {U}}{\textbf {U}}^{\textrm{T}}\) are smaller than a positive constant \(c_2\).
(B2)
The penalty function satisfies the following properties.
(i) \(\rho (x;\lambda )\) is nondecreasing in \(x\in [0,\infty )\) and \(\rho (0;\lambda )=0\).
(ii) Let \(\dot{\rho }(x;\lambda )=\partial \rho (x;\lambda )/\partial x\). It exists and is bounded in \(x\in (0,\infty )\). In addition, \(\dot{\rho }(x;\lambda )/\lambda >0\), as \(x\rightarrow 0+\), \(n\rightarrow \infty\), and \(|\dot{\rho }(x_1;\lambda )-\dot{\rho }(x_2;\lambda )|\le O(1)\lambda |x_1-x_2|\), for \(x_1,x_2\in (0,\infty )\).
(iii) Let \(\ddot{\rho }(x;\lambda )=\partial ^2\rho (x;\lambda )/\partial x^2\). It exists and is bounded in \(x\in (\gamma _1\lambda ,\infty )\), where \(\gamma _1>0\) is a constant. It holds that \(|\ddot{\rho }(x_1;\lambda )-\ddot{\rho }(x_2;\lambda )|\le O(1)|x_1-x_2|\), for \(x_1,x_2\in (\gamma _1\lambda ,\infty )\).
(B3)
The pre-estimated nuisance parameter \(\widehat{\eta }\) is independent of the samples used to build the loss function. Considering the abuse of notation, we continue to use n to denote the sample size for constructing the loss function and assume that the pre-estimated nuisance parameter \(\hat{\eta }\) is obtained from another sample of size \(r_\eta n\), where \(r_\eta n\) is bounded by a positive constant. Let \(R_{e,n}\) be the convergence rate of \(\Vert \widehat{e}-e^*\Vert _\infty\), \(R_{\mu ,n}\) be the convergence rate of \(\Vert \widehat{\mu }-\mu ^*\Vert _\infty\). Define rate \(R_{n}=\max \left\{ R_{\mu ,n},R_{e,n}\Vert \varvec{\theta }^*\Vert _2,\sqrt{p/n}\right\}\). The real parameter satisfies \(\min _{j\in \mathcal {D}_1}|\alpha _j^*|/\lambda _1\rightarrow \infty\), \(\min _{j\in \mathcal {D}_2}|\beta _j^*|/\lambda _2\rightarrow \infty\), as \(n\rightarrow \infty\). Additional conditions are listed in the followings.
(iv)\(\sqrt{n}R_{n}^2=o(1)\), and \(\sqrt{n}\max \left\{ \lambda _1,\lambda _2\right\} R_{n}=o(1)\).
(v) \(\mathbb {E}|\phi _j(\widetilde{{\textbf {Z}}},\widetilde{T};\varvec{\theta },\eta )-\phi _j(\widetilde{{\textbf {Z}}},\widetilde{T};\varvec{\theta }^*,\eta ^*)|^2\le \left( \Vert \varvec{\theta }-\varvec{\theta }^*\Vert _2\vee \Vert \eta -\eta ^*\Vert _\infty \right) ^bc_3\), \(j\in \mathcal {D}\), where b and \(c_3\) are positive constants. In addition, \(\sqrt{d_{n}}R_{n}^{b/2}=o(1)\), \(\sqrt{d_{n}}n^{-1/2+1/q}=o(1)\), \(q>2\).
(B4)
In what follows, we use \(\Vert \cdot \Vert _{Q,q}\) to denote the \(L^q(Q)\) norm. The uniform entropy numbers for set \(\mathcal {F}\) with radius \(\xi >0\) under \(L^q(Q)\) norm are defined as \(\sup _Q \log N(\xi , \mathcal {F}, \Vert \cdot \Vert _{Q,q})\), where \(N(\xi , \mathcal {F}, \Vert \cdot \Vert _{Q,q})\) is the corresponding covering number. Let \(\varTheta =\left\{ \varvec{\theta }:\Vert \varvec{\theta }-\varvec{\theta }^*\Vert _2\le R_{n}c_4\right\}\), where \(c_4\) is a positive constant. Define class
with measurable envelop \(F_{1,\eta }\). It satisfies \(\Vert F_{1,\eta }\Vert _{F^0,q}\le c_5\) where \(c_5\) is a positive constant and \(F^0\in \mathcal {P}\). It holds that for all \(0<\xi \le 1\), the uniform entropy number of \(\mathcal {F}_{1,\eta }\) obeys
\({\textbf {(B0)}}\) guarantees that Stute’s empirical probability measure converges to \(F^0\) (Stute 1993). In addition, it assures the asymptotic normality of \(\sum _{i=1}^nw_{i}\phi _j(\widetilde{{\textbf {Z}}}_{(i)},T_{(i)})\) given real nuisance functions (Stute 1996). \({\textbf {(B0)}}\)(i) assumes that the censoring variable is conditionally independent of \(({\textbf {X}},A,S)\) given the failure time \(\widetilde{T}\). By contrast, the utilization of the Inverse Censoring Probability Weight (IPCW) of the form \(\delta /\Pr (C>T|{\textbf {X}},A,S,T)\) often requires \(C\perp \widetilde{T}|S,A,{\textbf {X}}\) instead. While Stute’s weights are fully non-parametric, using the IPCW requires estimating of the survival function for the censoring time, \(\Pr (C>t|{\textbf {X}},A,S)\), which may introduce additional model assumptions. We discuss the further development based on IPCW in Sect. 6. \({\textbf {(B1)}}\) puts constraints on eigenvalues of design matrices. \({\textbf {(B2)}}\) states the basic properties of the penalty function. Many penalty functions, such as SCAD and MCP, can meet these properties. \({\textbf {(B3)}}\) contains several assumptions on nuisance functions, penalty function and convergence rate. It should be noted that the assumption of independent pre-estimated nuisance parameter can be reached by data splitting. \({\textbf {(B3)}}\)(i)-(iii) naturally hold when the signal of the real parameter is strong enough. \({\textbf {(B3)}}\)(iv) requires the convergence rate of the nuicance function estimate to be at least faster than \(n^{-1/4}\), and \(p=o(n^{1/2})\). \({\textbf {(B4)}}\) is used to reach the condition in Lemma 6.2 in Chernozhukov et al. (2018).
Theorem 1
(Consistency) If \({\textbf {(M0)}}\), \({\textbf {(A0)}}\), \({\textbf {(B0)}}\), \({\textbf {(B1)}}\), \({\textbf {(B2)}}\) and \({\textbf {(B3)}}\) (i)(ii) hold, then \(\Vert \widehat{\varvec{\theta }}-\varvec{\theta }^*\Vert _2=O_p(R_{n})\), where \(R_{n}\) is defined in (B3).
Theorem 2
(Sparsity recovery) Suppose the result in Theorem 1 holds. If \(\min \left\{ \lambda _1,\lambda _2\right\} R_{n}^{-1}\rightarrow \infty\) as \(n\rightarrow \infty\), then \(\Pr \left( \widehat{\varvec{\theta }}_{\mathcal {D}^c}={\textbf {0}}\right) \rightarrow 1\).
To derive the asymptotic normality of the proposed estimator, we introduce some notations first. Let \(\widetilde{H}^{1}({\textbf {z}},y)=\Pr (\widetilde{{\textbf {Z}}}\le {\textbf {z}}, T\le y, \delta =1)\),
(Asymptotic normality) Suppose the result of consistency and sparsity recovery hold. Assume that \({\textbf {(M0)}}\), \({\textbf {(A0)}}\), \({\textbf {(B0)}}\)-\({\textbf {(B2)}}\) and \({\textbf {(B3)}}\) (iii)(iv)(v) hold. For any \({\textbf {q}}\in \mathbb {R}^{d_{n}}\), \(\Vert {\textbf {q}}\Vert _2<\infty\), if \(\sigma ^2{\textbf {q}}^{\textrm{T}}{\textbf {V}}{\textbf {q}}\rightarrow \sigma _*^2\) as \(n\rightarrow \infty\), then
We need additional conditions to derive the asymptotic properties of the RCT-only estimator. Since the conditions are similar to \({\textbf {(B0)}}\)-\({\textbf {(B4)}}\), details are presented in the Appendix. Define rate \(R_{n_1}=\max \left\{ R_{\mu ,n_1},R_{e,n_1}\Vert \varvec{\alpha }^*\Vert _2,\sqrt{p/n_1}\right\}\), and \({\textbf {V}}_r={\textbf {B}}_r^{-1}\varvec{\varSigma }_{r\mathcal {D}_1}{\textbf {B}}_r^{-1}\), where
The definition of \(\varvec{\varphi }\), \(\gamma _{r0}\), \(\varvec{\gamma }_{r1}\), \(\varvec{\gamma }_{r2}\) are presented in Appendix for details.
Theorem 4
(Asymptotic normality for RCT-only estimator) Suppose the result of consistency with rate \(R_{n_1}\) and sparsity recovery hold. Assume that \({\textbf {(M0)}}\), \({\textbf {(A0)}}\), \({\textbf {(C0)}}\), \({\textbf {(C1)}}\) and \({\textbf {(C2)}}\) (in the Appendix) hold. For any \({\textbf {q}}\in \mathbb {R}^{d_{1n}}\), \(\Vert {\textbf {q}}\Vert _2<\infty\), if \(\sigma ^2{\textbf {q}}^{\textrm{T}}{\textbf {V}}_r{\textbf {q}}\rightarrow \sigma _{r*}^2\) as \(n_1\rightarrow \infty\), then
(Efficiency gain) Suppose the results in Theorem 3 and 4 hold. If there is no censoring, it can be seen that \(\varvec{\varSigma }_{r\mathcal {D}_1}={\textbf {B}}_r\), and \(\varvec{\varSigma }_{\mathcal {D}}={\textbf {B}}\). For any \({\textbf {q}}\in \mathbb {R}^{d_{1n}}\), \(\Vert {\textbf {q}}\Vert _2<\infty\), with probability converging to 1, we have
$$\begin{aligned} \text{ Var }(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}_{\mathcal {D}_1})\le \text{ Var }(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}^{rct}_{\mathcal {D}_1}), \end{aligned}$$
(7)
where the equality holds if and only if there exists a \(d_{2n}\times d_{1n}\) constant matrix \({\textbf {Q}}\), such that when \(S=0\), \({\textbf {X}}_{\mathcal {D}_1}={\textbf {Q}}^{\textrm{T}}{\textbf {X}}_{\mathcal {D}_2}\). Specially, when \(\mathcal {D}_1\subset \mathcal {D}_2\), the equality holds. When \({\mathcal {D}_2}=\emptyset\), under (B1)(i), the inequality in (7) strictly holds.
[Style2 Style3 Style3]Remark 4
Censoring leads to a more complicated form of variance, thus it is difficult to see the efficiency gain directly. Let \({\textbf {B}}=({\textbf {B}}_{11},{\textbf {B}}_{12};{\textbf {B}}_{21},{\textbf {B}}_{22})\) where \({\textbf {B}}_{11}=\mathbb {E}\left\{ A-e^*({\textbf {Z}})\right\} ^2{\textbf {X}}_{\mathcal {D}_1}{\textbf {X}}_{\mathcal {D}_1}^{\textrm{T}}\), \({\textbf {B}}_{12}=\mathbb {E}(1-S)\left\{ A-e^*({\textbf {Z}})\right\} ^2{\textbf {X}}_{\mathcal {D}_1}{\textbf {X}}_{\mathcal {D}_2}^{\textrm{T}}\), \({\textbf {B}}_{22}=\mathbb {E}(1-S)\left\{ A-e^*({\textbf {Z}})\right\} ^2{\textbf {X}}_{\mathcal {D}_2}{\textbf {X}}_{\mathcal {D}_2}^{\textrm{T}}\). Define \(\varvec{\varOmega }_{11}=\left( {\textbf {B}}_{11}-{\textbf {B}}_{12}{\textbf {B}}_{22}^{-1}{\textbf {B}}_{12}^{\textrm{T}}\right) ^{-1}\). Let \(\varvec{\varSigma }_{11}\) be the submatrix of \(\varvec{\varSigma }\) with columns and rows corresponding to \(\varvec{\alpha }^*_{\mathcal {D}_1}\), \(\varvec{\varSigma }_{12}\) be the submatrix with columns corresponding to \(\varvec{\alpha }^*_{\mathcal {D}_1}\) and rows corresponding to \(\varvec{\beta }^*_{\mathcal {D}_2}\), \(\varvec{\varSigma }_{22}\) be the submatrix with columns and rows corresponding to \(\varvec{\beta }^*_{\mathcal {D}_2}\). Let \(r_{_S}=\Pr (S=1)\), and \(\varDelta \varvec{\varSigma }_{11}= \varvec{\varSigma }_{11}-{\textbf {B}}_{12}{\textbf {B}}_{22}^{-1}\varvec{\varSigma }_{12}^{\textrm{T}}-\varvec{\varSigma }_{12}{\textbf {B}}_{22}^{-1}{\textbf {B}}_{12}^{\textrm{T}}+{\textbf {B}}_{12}{\textbf {B}}_{22}^{-1}\varvec{\varSigma }_{22}{\textbf {B}}_{22}^{-1}{\textbf {B}}_{12}^{\textrm{T}}\). Generally, if
that is, the matrix is semi-definite, then the variance of the proposed estimate \(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}_{\mathcal {D}_1}\) will not larger than the RCT-only estimate \(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}_{\mathcal {D}_1}^{rct}\). Specially when \(\mathcal {D}_2=\emptyset\), \({\textbf {B}}_{11}={\textbf {B}}_r\) and \(\varvec{\varSigma }_{11}=\varvec{\varSigma }_r\), i.e., the distributions of \((A,{\textbf {X}})\) and censoring in RCT and RWD are similar, then the variance of the proposed estimate \(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}_{\mathcal {D}_1}\) will be rigorously smaller than that of the RCT-only estimate \(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}_{\mathcal {D}_1}^{rct}\).
[Style2 Style3 Style3]Remark 5
(Variance estimation) The theoretical variances obtained in Theorems 3 and 4 are not easy to estimate based on the formulations. Following Huang et al. (2006), we estimate the variance using the nonparametric 0.632 bootstrap (Efron and Tibshirani 1993), in which approximately \(0.632n\) samples from the \(n\) observations are randomly selected without replacement.
4 Simulation
We conduct simulation studies to evaluate the performance of the proposed method including efficiency gain (compared with the estimators that only use RCT data), parameters estimation, variable selection and identification of unmeasured confounding. The data is generated from the following model
where \(\mu _0({\textbf {X}},S)=\sin (X_1)+0.2X_4^2-0.5{\textbf {X}}^{\textrm{T}}\varvec{\alpha }^*-0.5(1-S){\textbf {X}}^{\textrm{T}}\varvec{\beta }^*\). Here \({\textbf {X}}\) is observable, while u is the unmeasured confounding effect. We generate \(n=2500\) samples from this model with the following distributions: \(S\sim Bernoulli(0.2)\), \(A\sim Bernoulli(0.5)\), \({\textbf {X}}|A\sim N(0.2A\times ({\textbf {1}}_{8},{\textbf {0}}_{p-8}),\varvec{\varSigma })\), \(u|A\sim N(A{\textbf {X}}^{\textrm{T}}\varvec{\beta }^*,\varvec{\varSigma })\), and \(\epsilon \sim N(0,1)\), where \(\varvec{\varSigma }=(0.3^{|i-j|},i,j=1,2,...,p)\). Let \(\varvec{\alpha }^*=Signal\times ({\textbf {1}}_{4},-{\textbf {1}}_{4},{\textbf {0}}_{p-8})\), \(\varvec{\beta }^*=Signal\times ({\textbf {1}}_{2},-{\textbf {1}}_{2},{\textbf {0}}_{p-4})\), \(Signal=2\), provided that unmeasured confounding effect exists, otherwise \(\varvec{\beta }^*={\textbf {0}}_{p}\). The dimension of \({\textbf {X}}\) is considered to be \(p\in \left\{ 20,50\right\}\), the censored time \(\log C\sim \text {Unif}[t_0,t_1]\), where \(t_0\), \(t_1\) adjust the censored rate to be around 20% or 40%. We adopt the MCP function as the penalty function, i.e., \(\rho (t;\lambda )=\lambda \int _{0}^{t}\big (1-x/(\gamma \lambda )\big )_+dx\).
4.1 Finite-Sample studies
For the proposed method, cross-validation and BIC to select the tuning parameters and refer to them as RL.cv and RL.bic, respectively. We also implement the analysis that ignores the unmeasured confounding effect and refers to it as RL.NAI. In addition, we compare the proposed method with the following methods:
Outcome-adjusted method: define the adjusted outcome
Under assumption \({\textbf {(A0)}}\) and \({\textbf {(M0)}}\), \(\mathbb {E}\left( \widetilde{T}^{adjust}|{\textbf {Z}}\right) ={\textbf {X}}^{\textrm{T}}\varvec{\alpha }+(1-S){\textbf {X}}^{\textrm{T}}\varvec{\beta }\). Then we can build the penalized regression model based on this equation (similar to the construction of the proposed method). We use the same penalty function as the proposed method to identify unmeasured confounding effect and adopt CV and BIC to select the tuning parameters. This method is referred to as OA.cv and OA.bic respectively.
AFT model with\(\mu _0\): under assumption \({\textbf {(A0)}}\) and \({\textbf {(M0)}}\), it holds that
Then we can build the AFT model based on this equation. The estimation procedures are the same as the outcome-adjusted method. This method is referred to as GM0.cv and GM0.bic respectively.
AFT model with\(\mu _1\): under assumption \({\textbf {(A0)}}\) and \({\textbf {(M0)}}\), it holds that
Then we can build the AFT model based on this equation. The estimation procedures are the same as the outcome-adjusted method. This method is referred to as GM1.cv and GM1.bic respectively.
The meta estimates: combine GM0.cv and GM1.cv (GM0.bic and GM1.bic) by weights of sample size. This method is referred to as Meta.cv and Meta.bic respectively.
AFT model with\(\mu _0\), \(\mu _1\): under assumption \({\textbf {(A0)}}\) and \({\textbf {(M0)}}\), it holds that
Then we can build the AFT model based on this equation. The following procedures are the same as the outcome-adjusted method. This method is referred to as GM01.cv and GM01.bic respectively.
We calculate the RCT-only estimates for all these methods and use CV to select tuning parameters referred to as RL.RCT, OA.RCT, GM0.RCT, GM1.RCT, Meta.RCT, GM01.RCT, respectively. In addition, assuming that we correctly select the variables, we can calculate the oracle estimates referred to as RL.or, RL.NAIor, OA.or, GM0.or, GM1.or, Meta.or, GM01.or, RL.RCTor, OA.RCTor, GM0.RCTor, GM1.RCTor, Meta.RCTor, GM01.RCTor, respectively.
For the estimation of HTE’s parameter, we use mean square error (MSE) to evaluate the performance (variance) of estimation and use false discovery rate (FDR) to evaluate the performance of variable selection. The definitions are as follow: for simulation times \(b=1,2,...,B\), \(\text {RMSE}=(\text {MSE})^{1/2}\), where \(\text {MSE}=(Bp)^{-1}\sum _{b=1}^{B}\sum _{j=1}^{p}(\widehat{\alpha }_j^{(b)}-\alpha ^*_j)^2,\)
The empirical results are based on \(B=500\) replications.
The simulation results are shown in Table 1 and 2. We make the following observations. (i) For the Oracle estimators, the estimators that utilize RWD act better than RCT-only estimators. The proposed Oracle estimator (RL.or) has the minimal RMSE in all settings compared with the estimates from other methods. (ii) The estimator that ignores the unmeasured confounding effect in RWD (RL.NAI) has the highest RMSE. This shows that ignorance of unmeasured confounding can lead to significant estimation error, confirming the necessity of identifying unmeasured confounding effect in RWD. (iii) For the methods to select tuning parameters, CV is competitive with BIC when there is no unmeasured confounding effect and better than BIC when there is unmeasured confounding effect. In the following, we just analyze the results from the CV. All estimators that utilize RWD have smaller RMSE than the RCT-only estimators. Among the reported estimators, the RMSE of the proposed estimate (RL.cv) is sensibly lower than other methods. The RMSE of OA.cv is the second lowest. The RMSE of Meta.cv is slightly higher than that of OA.cv. (iv) The proposed estimator (RL.cv) has a slightly lower/competitive FDR than that of OA.cv, and it is sensibly lower than that of other methods. (v) Based on the results of TIR, it can be seen that all methods can identify the case well when there is unmeasured confounding. When no unmeasured confounding effect exists, the proposed and outcome-adjusted methods perform better than other methods. (vi) Generally, the RMSEs and FDRs have no better performances when the censoring rate increases. If there is no unmeasured confounding effect, the estimators that utilize RWD gain more efficiency than the estimators in the case where an unmeasured confounding effect exists.
Table 1
The RMSE (\(\times 10^{2}\)) of the HTE estimation when \(Signal=2\) over 500 experiment replicates
Methods
with unmeasured confounding
no unmeasured confounding
p=20
p=50
p=20
p=50
CR=20%
CR=40%
CR=20%
CR=40%
CR=20%
CR=40%
CR=20%
CR=40%
RL.or
8.24
10.08
5.29
6.64
5.42
6.71
3.55
4.22
RL.RCTor
15.16
18.15
9.22
11.07
15.20
17.94
8.96
11.09
RL.NAIor
71.00
71.10
44.83
44.98
RL.cv
8.40
10.35
5.31
6.71
5.64
7.15
3.59
4.35
RL.bic
8.26
10.13
5.32
7.63
5.48
6.81
3.56
4.19
RL.RCT
15.67
19.05
9.29
11.78
15.71
18.95
9.22
11.82
RL.NAI
71.67
72.28
45.88
46.67
OA.or
11.30
12.97
7.10
8.54
8.87
10.72
5.71
7.04
OA.RCTor
22.23
25.97
15.82
18.49
22.32
25.95
15.75
18.57
OA.cv
11.34
13.10
7.13
8.60
8.90
10.81
5.73
7.05
OA.bic
11.33
13.03
7.13
9.12
8.89
10.75
5.72
7.04
OA.RCT
22.36
26.20
15.89
18.65
22.44
26.20
15.84
18.73
GM0.or
13.73
15.87
8.81
10.33
10.47
12.43
6.76
8.06
GM0.RCTor
24.92
29.09
17.40
20.61
24.92
29.09
17.40
20.42
GM0.cv
14.26
16.69
9.03
10.81
11.16
13.13
6.98
8.45
GM0.biv
13.78
16.20
8.97
11.41
10.74
12.54
6.85
8.10
GM0.RCT
24.81
29.10
17.35
21.02
24.81
29.10
17.32
20.85
GM1.or
11.76
13.79
7.54
9.23
9.52
11.35
6.13
7.40
GM1.RCTor
22.67
26.66
15.84
18.36
22.62
26.62
15.93
18.36
GM1.cv
12.35
14.68
7.81
9.86
9.99
11.98
6.34
7.77
GM1.bic
11.93
14.02
7.62
10.16
9.70
11.50
6.22
7.46
GM1.RCT
22.57
26.64
15.77
18.51
22.51
26.61
15.88
18.48
Meta.or
11.14
12.86
7.16
8.49
9.32
11.12
6.02
7.24
Meta.RCTor
22.34
25.93
15.68
18.19
22.31
25.93
15.67
18.12
Meta.cv
11.48
13.39
7.31
8.86
9.70
11.45
6.18
7.45
Meta.bic
11.21
13.04
7.25
9.20
9.51
11.20
6.11
7.27
Meta.RCT
22.17
25.77
15.59
18.35
22.13
25.77
15.58
18.28
GM01.or
13.77
17.05
8.95
11.30
11.50
14.86
7.49
9.77
GM01.RCTor
28.50
36.42
19.88
25.77
28.41
36.36
19.93
25.56
GM01.cv
14.63
18.69
9.66
13.40
12.49
16.61
8.06
11.80
GM01.bic
13.94
17.48
9.03
11.60
11.83
15.50
7.54
10.09
GM01.RCT
28.65
36.67
19.95
26.03
28.56
36.60
19.99
25.77
Some results are marked in bold to make it clear for readers to make comparison between different methods. The results behave the best except for the oracle estimates (i.e., smallest RMSE) are marked with underlines
In the table, CR represents cencoring rate. Among these methods, those with names starting with “RL" indicate the proposed model. RL.cv and RL.bic represent the proposed estimates under CV and BIC criterion respectively. RL.RCT represents the estimate merely based on RCT data. RL.NAI is the naive estimate which completely ignores unmeasured confounding effect. RL.or, RL.RCTor, and RL.NAIor are the oracle estimates of the integrative analysis, RCT-only analysis, and naive analysis respectively. Other methods with names starting with “OA”, “GM0”, “GM1”, “Meta” and “GM01” are introduced in detail in Sect. 4
Table 2
The averaged TIR(\(/\%\)) and FDR(\(/\%\)) when \(Signal=2\) over 500 experiment replicates
Index
Methods
with unmeasured confounding
no unmeasured confounding
p=20
p=50
p=20
p=50
CR=20%
CR=40%
CR=20%
CR=40%
CR=20%
CR=40%
CR=20%
CR=40%
TIR
RL.cv
100
100
100
100
97.8
97.3
98.5
98.1
RL.bic
100
100
100
100
97.8
98.6
99.2
98.8
OA.cv
100
100
100
100
98.8
97.8
99.2
99.0
OA.bic
100
100
100
100
99.4
99.4
99.2
99.1
GM0.cv
100
100
100
100
85.0
82.0
90.8
87.2
GM0.biv
100
100
100
100
95.4
95.4
97.8
97.6
GM1.cv
100
100
100
100
87.0
84.6
92.4
86.8
GM1.bic
100
100
100
100
96.6
94.2
98.4
95.2
Meta.cv
100
100
100
100
74.2
70.2
85.0
75.4
Meta.bic
100
100
100
100
92.0
89.6
96.2
92.8
GM01.cv
100
100
100
100
60.2
26.8
61.6
13.4
GM01.bic
100
100
100
100
88.4
67.8
92.6
64.2
FDR
RL.cv
0.28
0.67
0.07
0.71
0.32
0.93
0.13
0.37
RL.bic
0.07
0.20
0.04
0.51
0.07
0.20
0.02
0.09
RL.RCT
3.51
5.00
3.48
6.16
3.39
5.07
3.63
5.81
RL.NAI
1.31
2.61
0.87
2.28
OA.cv
0.04
0.11
0.04
0.18
0.09
0.37
0.04
0.07
OA.bic
0.04
0.04
0.02
0.11
0.02
0.09
0.00
0.00
OA.RCT
0.90
1.55
0.56
1.59
0.80
1.71
0.74
1.71
GM0.cv
1.37
2.71
1.63
3.85
0.90
1.91
1.15
3.00
GM0.biv
0.18
0.77
0.40
1.34
0.24
0.34
0.16
0.24
GM0.RCT
1.51
2.24
0.86
1.81
1.51
2.24
1.03
1.65
GM1.cv
1.14
2.26
0.92
3.35
0.68
1.49
0.84
2.51
GM1.bic
0.13
0.59
0.07
1.18
0.16
0.18
0.11
0.24
GM1.RCT
1.21
2.55
0.46
1.85
1.29
2.66
0.68
1.44
Meta.cv
2.45
4.71
2.50
6.80
1.54
3.24
1.98
5.23
Meta.bic
0.31
1.35
0.46
2.43
0.37
0.52
0.27
0.46
Meta.RCT
2.66
4.63
1.31
3.58
2.76
4.74
1.71
3.05
GM01.cv
1.69
7.05
3.99
19.3
1.38
5.50
2.51
16.6
GM01.bic
0.16
1.02
0.28
2.24
0.18
1.08
0.13
2.24
GM01.RCT
1.18
2.30
0.37
1.81
1.12
2.09
0.30
1.60
In the table, CR represents cencoring rate. TIR is the rate of correctly identifying the real case where unmeasured confounding effect exists or not. FDR is the false discovery rate of the HTE estimates. Among these methods, those with names starting with “RL" indicate the proposed model. RL.cv and RL.bic represent the proposed estimates under CV and BIC criterion respectively. RL.RCT represents the estimate merely based on RCT data. RL.NAI is the naive estimate which completely ignores unmeasured confounding effect. Other methods with names starting with “OA”, “GM0”, “GM1”, “Meta” and “GM01” are introduced in detail in Sect. 4
Additional simulation experiments considering a weaker signal strength of the coefficients (\(Signal=1\)), a more severer censoring rate (CR\(=60\%\)), and the log-logistic distribution of the survival time are presented in the supplementary materials in detail. The results show that the proposed method maintains its effectiveness across these settings. To summarize, the proposed method can identify unmeasured confounding effects well and gains more efficiency than the RCT-only estimators. The proposed estimator did well in cases including relatively high dimensions and severe censoring. In addition, it acts the best compared with the estimates from other reported methods.
4.2 Variance estimation
Simulations are implemented to evaluate the nonparametric bootstrap approach for variance estimation. The details of the estimation method are presented in Remark 5. We compute the variance estimates for the proposed method using two types of data: one combining RCT with RWD (denoted as RCT+RWD), and the other using RCT data alone. Here we take the bootstrap sample size of 500. In Table 3 and 4, we show the average of the point estimates (Mean), standard deviations (SD), the means of the bootstrap estimated standard deviations (SE), and the 0.95 coverage proportion (CP) based on 500 replications.
Upon examining Tables 3 and 4, it is evident that the bootstrap standard deviation estimates match the standard deviations of the estimates well. Furthermore, the variance of the estimates derived from the combined RCT+RWD dataset is observed to be lower than that obtained from RCT data alone. This variance reduction, or shrinkage, is particularly pronounced for the coefficients in \(\mathcal {D}_1\setminus \mathcal {D}_2\).
Table 3
The inference results of the proposed HTE estimate over 500 experiment replicates when \(p=20\)
Case
Dataset
Index
\(\alpha ^*_1=-2\)
\(\alpha ^*_2=-2\)
\(\alpha ^*_3=-2\)
\(\alpha ^*_4=-2\)
\(\alpha ^*_5=2\)
\(\alpha ^*_6=2\)
\(\alpha ^*_7=2\)
\(\alpha ^*_8=2\)
1
RCT
Bias
0.090
0.098
0.098
0.083
\(-\)0.079
\(-\)0.102
\(-\)0.095
\(-\)0.084
SD
0.167
0.185
0.193
0.193
0.192
0.190
0.182
0.177
SE
0.188
0.196
0.196
0.203
0.197
0.197
0.196
0.192
CP(95%)
0.939
0.921
0.926
0.936
0.934
0.933
0.933
0.939
RCT+RWD
Bias
0.008
0.006
0.006
0.035
0.010
0.005
0.005
0.013
SD
0.080
0.085
0.083
0.094
0.086
0.087
0.083
0.079
SE
0.081
0.084
0.083
0.090
0.088
0.084
0.084
0.081
CP(95%)
0.946
0.947
0.923
0.910
0.931
0.931
0.944
0.944
2
RCT
Bias
0.079
0.076
0.088
0.073
\(-\)0.083
\(-\)0.079
\(-\)0.094
\(-\)0.079
SD
0.208
0.223
0.234
0.233
0.240
0.214
0.215
0.221
SE
0.230
0.239
0.239
0.247
0.239
0.241
0.239
0.232
CP(95%)
0.959
0.959
0.928
0.949
0.925
0.959
0.955
0.949
RCT+RWD
Bias
0.009
0.006
0.009
0.035
0.010
0.007
0.004
0.012
SD
0.094
0.100
0.101
0.114
0.100
0.101
0.103
0.091
SE
0.097
0.099
0.101
0.110
0.111
0.100
0.100
0.099
CP(95%)
0.955
0.951
0.940
0.904
0.947
0.932
0.957
0.947
3
RCT
Bias
0.102
0.092
0.099
0.078
\(-\)0.076
\(-\)0.098
\(-\)0.086
\(-\)0.098
SD
0.172
0.184
0.194
0.197
0.187
0.190
0.185
0.182
SE
0.187
0.196
0.196
0.202
0.197
0.198
0.198
0.191
CP(95%)
0.933
0.931
0.927
0.931
0.944
0.929
0.944
0.929
RCT+RWD
Bias
0.040
0.021
0.030
\(-\)0.015
0.004
0.006
0.007
0.011
SD
0.153
0.160
0.160
0.164
0.081
0.082
0.088
0.080
SE
0.169
0.182
0.166
0.163
0.090
0.093
0.094
0.089
CP(95%)
0.958
0.960
0.949
0.947
0.967
0.960
0.956
0.960
4
RCT
Bias
0.091
0.081
0.087
0.069
\(-\)0.084
\(-\)0.080
\(-\)0.093
\(-\)0.078
SD
0.205
0.221
0.231
0.228
0.228
0.215
0.208
0.224
SE
0.230
0.241
0.238
0.245
0.239
0.239
0.238
0.231
CP(95%)
0.959
0.953
0.931
0.957
0.936
0.949
0.949
0.938
RCT+RWD
Bias
0.041
0.032
0.030
\(-\)0.027
0.011
0.012
0.011
0.011
SD
0.188
0.185
0.188
0.187
0.103
0.104
0.104
0.100
SE
0.216
0.230
0.204
0.202
0.122
0.123
0.122
0.118
CP(95%)
0.966
0.972
0.968
0.968
0.966
0.976
0.970
0.968
Case 1-4 represent (nuc, 20%CR), (nuc, 40%CR), (uc, 20%CR), (uc, 40%CR), where nuc means there is no unmeasured confounding while uc means there is unmeasured confounding. The SEs are marked in bold to make it clear for readers to make comparison between RCT and RCT+RWD
Table 4
The inference results of the proposed HTE estimate over 500 experiment replicates when \(p=50\)
Case
Dataset
Index
\(\alpha ^*_1=-2\)
\(\alpha ^*_2=-2\)
\(\alpha ^*_3=-2\)
\(\alpha ^*_4=-2\)
\(\alpha ^*_5=2\)
\(\alpha ^*_6=2\)
\(\alpha ^*_7=2\)
\(\alpha ^*_8=2\)
1
RCT
Bias
0.076
0.073
0.071
0.067
\(-\)0.082
\(-\)0.056
\(-\)0.084
\(-\)0.083
SD
0.170
0.176
0.180
0.189
0.189
0.180
0.183
0.177
SE
0.180
0.189
0.189
0.196
0.188
0.189
0.188
0.182
CP(95%)
0.934
0.948
0.950
0.946
0.924
0.962
0.928
0.940
RCT+RWD
Bias
0.000
0.015
0.008
0.045
0.001
0.003
0.003
0.006
SD
0.084
0.085
0.089
0.086
0.085
0.089
0.089
0.083
SE
0.079
0.082
0.082
0.086
0.085
0.082
0.082
0.082
CP(95%)
0.942
0.930
0.930
0.903
0.938
0.928
0.930
0.946
2
RCT
Bias
0.052
0.076
0.079
0.068
\(-\)0.068
\(-\)0.058
\(-\)0.059
\(-\)0.072
SD
0.201
0.220
0.217
0.224
0.218
0.216
0.229
0.221
SE
0.222
0.232
0.232
0.240
0.232
0.232
0.232
0.224
CP(95%)
0.948
0.952
0.954
0.958
0.947
0.958
0.948
0.947
RCT+RWD
Bias
0.000
0.011
0.011
0.042
0.006
0.006
0.008
0.015
SD
0.093
0.097
0.101
0.103
0.098
0.099
0.104
0.096
SE
0.096
0.099
0.097
0.107
0.107
0.100
0.099
0.099
CP(95%)
0.950
0.948
0.937
0.906
0.941
0.941
0.924
0.933
3
RCT
Bias
0.064
0.092
0.069
0.085
\(-\)0.072
\(-\)0.070
\(-\)0.084
\(-\)0.077
SD
0.166
0.188
0.181
0.189
0.187
0.184
0.178
0.177
SE
0.180
0.188
0.187
0.195
0.188
0.189
0.187
0.182
CP(95%)
0.943
0.927
0.937
0.933
0.945
0.927
0.931
0.947
RCT+RWD
Bias
0.037
0.041
0.027
\(-\)0.014
0.005
0.007
0.004
0.008
SD
0.171
0.163
0.161
0.157
0.085
0.088
0.094
0.085
SE
0.177
0.198
0.170
0.167
0.090
0.092
0.091
0.088
CP(95%)
0.965
0.961
0.963
0.963
0.955
0.965
0.937
0.935
4
RCT
Bias
0.067
0.060
0.070
0.069
\(-\)0.066
\(-\)0.050
\(-\)0.077
\(-\)0.080
SD
0.197
0.231
0.202
0.226
0.212
0.208
0.228
0.224
SE
0.223
0.235
0.233
0.240
0.232
0.232
0.231
0.223
CP(95%)
0.967
0.947
0.971
0.967
0.965
0.963
0.941
0.940
RCT+RWD
Bias
0.035
0.032
0.030
\(-\)0.017
0.009
0.009
0.005
0.011
SD
0.182
0.209
0.204
0.199
0.104
0.109
0.108
0.103
SE
0.237
0.271
0.217
0.206
0.115
0.116
0.117
0.116
CP(95%)
0.974
0.976
0.943
0.958
0.960
0.943
0.949
0.958
Case 1-4 represent (nuc, 20%CR), (nuc, 40%CR), (uc, 20%CR), (uc, 40%CR), where nuc means there is no unmeasured confounding while uc means there is unmeasured confounding. The SEs are marked in bold to make it clear for readers to make comparison between RCT and RCT+RWD
5 Application
Lung cancer has become the primary cause of cancer-related deaths across the globe, with increasing incidence over the last two decades (Sung et al. 2021). Surgical resection, including lobectomy and sublobar resection, is commonly used for early-stage lung cancer. Lobectomy involves the complete removal of the lung lobe where the tumor is located, while sublobar resection only entails the removal of a smaller section of the complicated lobe. In 1995, Ginsberg and Rubinstein reported a randomized trial that compared lobectomy with sublobar resection in patients with clinical T1N0 non-small-cell lung cancer (NSCLC) (Ginsberg and Rubinstein 1995). They found that compared with lobectomy, sublobar resection does not confer improved perioperative morbidity, mortality, or late postoperative pulmonary function. These results made lobectomy the standard of surgical treatment for patients with clinical T1N0 NSCLC. Sublobar resection for early-stage lung cancer has only been assigned for patients with poor pulmonary reserve or other major comorbidities contraindicating lobectomy. Over the years, however, advances in imaging and staging methods have allowed the detection of smaller and earlier tumors, leading to a renewed interest in sublobar resection for patients with clinical stage IA NSCLC who might otherwise accept a lobectomy (Saji et al. 2022).
Fig. 1
The estimated covariate effects in HTE. Here \(hist\_ade\) indicates a presence of histologic type - adenocarcinoma, \(hist\_squ\) suggests a presence of the histologic type - squamous-cell carcinoma
Fig. 2
The estimated covariate effects in unmeasured confounding. Here \(hist\_ade\) indicates a presence of histologic type - adenocarcinoma, \(hist\_squ\) suggests a presence of the histologic type - squamous-cell carcinoma
C140503 is a multicenter, noninferiority, phase 3 trial where NSCLC patients with tumor size \(\le\)2 cm were randomly assigned to undergo sublobar resection or lobar resection after intraoperative confirmation of node-negative disease (Altorki et al. 2023). From June 2007 to March 2017, a total of 697 patients were assigned to undergo sublobar resection (340 patients) or lobar resection (357 patients). For disease-free survival, the right censoring rate is 59.7% in the group with sublobar resection and 60.5% in the group with lobar resection. It concluded that sublobar resection was non-inferior to lobar resection with respect to disease-free survival. In addition, a post hoc analysis of the heterogeneity of treatment effects for disease-free survival across patient subgroups based on the Cox proportional hazards model revealed that age and tumor size intended to post a negative effect and positive effect on lobar resection, respectively. NCDB is a clinical oncology database maintained by the American College of Surgeons, and it accounts for 72% of all newly diagnosed lung cancer cases in the United States. The NCDB analysis based on multivariate Cox proportional hazards model and propensity score-based methods reveals a significant advantage of lobectomy over limited resection, which contradicts the results of C140503. This contradictory result may be attributed to unobserved hidden confounders in the NCDB database. It has been well-documented that surgeons and patients tend to opt for limited resection over lobectomy when the patient’s health status is poor, functional respiratory service is low, and there is a high burden of comorbidities (Zhang et al. 2019; Lee and Altorki 2023). Unfortunately, these hidden confounders were not captured in the NCDB database, which could potentially result in biased estimates of treatment effects.
Though NCDB provides abundant samples, it fails to give a valid result of causal effect due to unmeasured confounding. We intend to apply the proposed method to integrate the NCDB data to C140503. It is interesting to see whether the efficiency of the HTE estimate can be improved. We randomly selected a cohort of 3000 patients with stage 1A NSCLC from the NCDB database, ensuring that their tumor size was \(\le\)2 cm and they met all the eligibility criteria for C140503. We consider the covariates that appear in both C140503 and NCDB, including race (white and other), sex (male and female), age, tumor size, histologic type (squamous-cell carcinoma, adenocarcinoma and other). The estimated covariate effects in HTE and unmeasured confounding are presented in Fig. 1 and 2 respectively. In Fig. 2, the result shows that the effect of unmeasured confounding exists, which is consistent with the previous findings. It also reveals that the hidden confounding is significantly related to the patient’s age under 90% confidence level. In Fig. 1, it can be observed that compared with the C140503-only method, the proposed integrative estimator yields shorter confidence intervals. In particular, the estimated effects of sex (\(\alpha _{sex}\)) and presence of histologic type adenocarcinoma (\(\alpha _{ade}\)) are shrunk to zero when integrating NCDB. Since the C140503-only estimates \(\widehat{\alpha }_{sex}^{rct}\), \(\widehat{\alpha }_{ade}^{rct}\) show that the upper tail of the 90% confidence interval of \(\widehat{\alpha }_{sex}^{rct}\) is closed to zero and \(\widehat{\alpha }_{ade}^{rct}\) is not sigificant, it is reasonable to see the shrinkage when synthesizing NCDB. These results indicate that integrating the NCDB data to C140503 does improve the efficiency, which convincingly demonstrates the practical effectiveness of the proposed method.
6 Conclusion
In this paper, we have developed an integrative method to give an improved estimate of HTE by synthesizing the evidence from RCTs and RWD, particularly in situations where the outcome of interest is subject to censoring and the number of covariates is diverging. It can be seen that the situations we consider are more complex and realistic, bringing more challenges. The proposed method can deal with cases where unmeasured confounding is present in RWD. It can identify whether the unmeasured confounding effect exists in a fully data-driven manner, contributing to more efficient estimates and a deeper understanding of the data generation mechanism. We have rigorously established the theoretical properties, showing that the proposed integrative method yields a more efficient HTE estimate, at least as good as those based on only the RCTs data. The proposed method is practically applicable. Based on the evidence from C140503, the randomized controlled trial, and NCDB database, the real-world data that might be subject to hidden confounding, we have applied the proposed method to improve the estimate of the HTE on survival for patients with clinical T1N0 NSCLC undergoing lobar resection. The results reported that integrating NCDB data into C140503 enhanced the HTE estimation, convincingly indicating the practicality of the proposed method.
In this project, we focus on developing data integration methods utilizing Stute weights, given their widespread use and suitability under the censoring assumptions. However, we acknowledge the potential benefits of exploring more general doubly robust weighting approaches. Future work could extend our methods to incorporate IPCW and doubly robust techniques, potentially building on the frameworks established by Lee et al. (2022) and Lee et al. (2024), initially designed for trial generalization. Moreover, in the context of integrating RWD into RCTs, there are still many problems to be solved. For example, it is common to see that RCTs and RWD have different covariates. Merely taking into account the shared covariates may incur other problems. For instance, some critical covariates to describe heterogeneity in treatment may be excluded. Thus, it is important to develop an integrative approach that can deal with non-uniform covariates in RCTs and RWD. Moreover, RCTs with time-varying treatments are common. Integrative analysis of the continuous-time structural failure time model (Yang et al. 2020a) combining the complementary features of RCT and RWD will be an important topic for future research.
Acknowledgements
We are grateful to the referees for their insightful comments and constructive suggestions, which have significantly enhanced the quality of the paper. This research is supported by National Institute on Aging (1R01AG066883), Directorate for Social, Behavioral and Economic Sciences (SES 2242776), and National Natural Science Foundation of China (No. 12271459).
Declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.