Skip to main content

Open Access 29.04.2025

Integrative analysis of high-dimensional RCT and RWD subject to censoring and hidden confounding

verfasst von: Xin Ye, Shu Yang, Xiaofei Wang, Yanyan Liu

Erschienen in: Lifetime Data Analysis

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Der Artikel untersucht die Integration von hochdimensionalen randomisierten kontrollierten Studien (RCTs) und Daten aus der realen Welt (RWD), um die Abschätzung heterogener Behandlungseffekte (HTE) in Szenarien zu verbessern, die Zensur und versteckte Verwirrung beinhalten. Es führt einen innovativen integrativen Regressionsansatz ein, der gleichzeitig Parameter schätzt, wichtige Variablen auswählt und unkontrollierte Störeffekte identifiziert. Die Methode nutzt die Stärken sowohl der RCTs als auch der RWD und adressiert die Beschränkungen traditioneller Ansätze, die häufig mit hochdimensionalen Daten und Zensur zu kämpfen haben. Durch rigorose theoretische Analysen und praktische Anwendungen zeigt der Artikel die überlegene Effizienz und Genauigkeit der vorgeschlagenen Methode bei der Schätzung von HTE, insbesondere in komplexen und realistischen medizinischen Forschungssituationen. Die Studie unterstreicht das Potenzial der Integration unterschiedlicher Datenquellen, um die Präzision der Abschätzung von Behandlungseffekten zu verbessern und den Weg für personalisiertere und effektivere medizinische Interventionen zu ebnen. Die Anwendung der vorgeschlagenen Methode auf reale Daten aus einer Lungenkrebsstudie unterstreicht ihre praktische Relevanz und möglichen Auswirkungen auf die klinische Entscheidungsfindung.
Hinweise

Supplementary Information

The online version contains supplementary material available at https://​doi.​org/​10.​1007/​s10985-025-09654-1.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Recently, there has been a growing focus on the heterogeneity of treatment effect (HTE), a vital path towards personalized medicine (Hamburg and Collins 2010; Collins and Varmus 2015). Accommodating confounding effects is crucial for obtaining well-estimated HTE. In such comparative medical research, it is important but challenging to fully determine what causes confounding effects and measure all of them. The most common approach is to conduct randomized controlled trials (RCTs). RCTs are known as the gold standard for assessing the causal effect of an intervention or treatment on the outcome of interest. The randomization allows the distribution of covariates in different groups to be balanced. However, RCTs have major downsides. For instance, they are costly and time-consuming, and often an inadequate sample size may result from recruitment challenges.
On the other hand, the increasing availability of real-world data (RWD) for research purposes, including electronic health records and disease registries, offers a broader demographic and diversity than RCTs. RWD provides abundant additional evidence to support HTE. Under the assumption that the records in RWD contain all the confounders, many approaches to harmonizing evidence from RCTs and RWD for HTE estimation have been developed, ranging from classic methods such as regression-based and inverse probability weighting to more recent machine learning models like neural networks (Shalit et al. 2017) and random forests (Wager and Athey 2018). Inspired by Robinson transformation (Robinson 1988), Nie and Wager (2021) recently proposed an R-learner to estimate HTE. The R-learner possesses the property of Neyman orthogonality (Neyman 1959), enabling the integration of more extensive and flexible machine-learning methods for estimating the nuisance functions. However, it is always possible that in uncontrolled real-world settings, important confounders may be overlooked or unmeasured. For instance, doctors assign treatment based on patient’s symptoms that are not documented in the medical chart. Unmeasured confounding can lead to unidentifiable causal effects of interest and result in distorted estimates of HTE.
Classical approaches, such as instrumental variable methods (Angrist et al. 1996), negative controls (Kuroki and Pearl 2014), and sensitivity analysis (Robins et al. 1999), have been proposed to address biases caused by hidden confounding. In recent years, a promising strategy to overcome the challenges posed by hidden confounding is to characterize the confounding function in RWD, and then utilize RCTs to identify both the HTE and confounding function. Drawing upon this idea, Kallus et al. (2018) proposed a regression-based method to estimate HTE. Yang et al. (2020b) established the semiparametric efficient score function to estimate the HTE and confounding function and demonstrated that their method can not only address issues arising from hidden confounding but also enhance the efficiency of HTE estimates. Additionally, they introduced a testing procedure to ascertain the presence of unmeasured confounding, which informs the decision on whether to integrate RWD for a joint analysis (Yang et al. 2023). However, once unmeasured confounding is detected, their approach discards all RWD data. More recently, Wu and Yang (2022) leveraged the benefits of the R-learner to develop an integrative method for estimating the HTE and confounding function, utilizing experimental data for model identification and observational data for efficiency boosting.
However, the approaches mentioned above are all limited to fully observed data and low-dimensional covariates. With the ongoing advancements in data acquisition technology and cloud storage, there is a growing trend towards the collection of high-dimensional data. Censoring frequently occurs in various fields, especially for survival data, where the exact time of the event of interest cannot be observed due to the limited duration of the study. Literature on estimating HTE from high-dimensional or censored data typically assumes the ideal case with no hidden confounding. Ma and Zhou (2018) proposed characterizing the hazard ratio to mimic the heterogeneous treatment effect, yet they did not take into account high-dimensional covariates. Zhu and Gallego (2020) used the difference in survival functions to describe heterogeneous treatment effects. Hu et al. (2021) utilized the difference in survival quantile to characterize the survival treatment effect at the individual level and adopted a machine learning approach for model estimation. Zhou and Zhu (2021) applied the sufficient dimension reduction technique to high-dimensional data without censoring. To our knowledge, the literature on estimating HTE from high-dimensional censored data while offsetting the unmeasured confounding effect remains scarce.
In this paper, we focus on improving the estimate of HTE for a survival outcome by integrating high-dimensional censored RCTs and RWD data, particularly in situations where unmeasured confounding may exist. We propose an integrative regression approach to simultaneously estimate parameters, select important variables, and determine the presence of unmeasured confounding effects. The proposed method assumes the transportability of the HTE. Therefore, the RCTs can be utilized to identify the HTE in RWD. Both the HTE and confounding function can be estimated through regularized weighted least square regression to accommodate censoring. The proposed method possesses the property of Neyman orthogonality, making it possible to adopt flexible machine-learning methods for the estimations of the nuisance functions. Theoretical properties are rigorously established, including estimation consistency, variable selection consistency, and asymptotic normality. We demonstrate that the proposed integrative method results in a more efficient HTE estimate, at least on par with estimates solely based on RCTs data. When there is unmeasured confounding, instead of excluding all data from RWD, the proposed method can still make use of the RWD data in some cases. This study has the potential to enhance the existing literature in multiple important aspects. First, an integrative analysis to include high-dimensional censored RWD data in HTE estimation is conducted, which can be more challenging than analyzing low-dimensional completely observable data. Secondly, the proposed approach permits the presence of unmeasured confounding, which is more flexible and complements the analysis that assumes the unconfoundness in RWD. Thirdly, the proposed approach can identify whether the unmeasured confounding effect exists in a fully data-driven manner. This can contribute to more accurate estimates and lead to a deeper understanding of the data generation mechanism. Lastly, and equally importantly, this study offers a valuable practical tool for addressing a wide range of scientific issues. In particular, we apply the proposed integrative approach to improve the estimate of HTE on overall survival for patients with early-stage non-small-cell lung cancer undergoing lobar resection and limited resection, which convincingly demonstrates the usefulness of the proposed method.
The remaining part of the paper is organized as follows. In Sect. 2, we introduce the proposed method. Theoretical properties are provided in Sect. 3. Numerical studies are conducted in Sect. 4, and application to real data is presented in Sect. 5. Concluding remarks are given in Sect. 6. Technical details are given in the Appendix.

2 Methods

Let \(\widetilde{T}\) be the failure time, \(C\) be the censored time, \(T=\min (\widetilde{T},C)\) be the observation with censoring indicator \(\delta =I(\widetilde{T}\le C)\), and \(A=0,1\) be the binary treatment variable. Let \({\textbf {X}}\in \mathbb {R}^p\) be the covariates vector, which includes the intercept term \(X_0\equiv 1\). Let \(S\) denote the data source, taking the value of 0 for RWD and 1 for RCT. The sample size of RCT is \(n_1\) and RWD is \(n_0\). Let the observed data be \(\mathcal{O}\mathcal{B}=\left\{ \mathcal{O}\mathcal{B}_i,i=1,2,...,n=n_1+n_0\right\}\), where \(\mathcal{O}\mathcal{B}_i=(T_i,\delta _i,{\textbf {X}}_i,A_i,S_i).\) Under the potential outcome framework, denote that \(\widetilde{T}(a)\), \(C(a)\) and \(T(a)=\min \left\{ \widetilde{T}(a),C(a)\right\}\) be the potential failure time, potential censored time and potential observed time under treatment \(a \in \left\{ 0,1\right\}\), respectively. We aim to evaluate the heterogeneous treatment effect (HTE) defined as follows
$$\begin{aligned} \tau ({\textbf {X}})=\mathbb {E}\left\{ \log \left( \widetilde{T}(1)\right) -\log \left( \widetilde{T}(0)\right) |{\textbf {X}}\right\} . \end{aligned}$$
The definition of the HTE aligns seamlessly with conventional survival models, as illustrated, e.g., in (1) and (2). The basic assumptions for modelling are as follows:
\({\textbf {(A0)}}\)
(i) \(\widetilde{T}=A\widetilde{T}(1)+(1-A)\widetilde{T}(0)\), \(C=AC(1)+(1-A)C(0)\), and \(T=AT(1)+(1-A)T(0)\).
(ii) \(\widetilde{T}(a)\perp A|({\textbf {X}},S=1)\), \(a\in \left\{ 0,1\right\}\).
(iii) \(\mathbb {E}\left\{ \log \left( \widetilde{T}(1)\right) -\log \left( \widetilde{T}(0)\right) |{\textbf {X}}\right\} =\mathbb {E}\left\{ \log \left( \widetilde{T}(1)\right) -\log \left( \widetilde{T}(0)\right) |{\textbf {X}},S\right\}\).
[Style2 Style3 Style3]Remark 1
(i) assumes that the consistency between observation and potential outcome holds.
(ii) holds for the RCT by default. (iii) states that the HTE is the same for the trial participants and the patient population at large. It holds that if trial participants are randomly recruited for each subgroup of X, or the exclusion criteria of trial participation do not affect the treatment response.
Define \(\mu _a({\textbf {X}},S)=\mathbb {E}\left\{ \log (\widetilde{T})|A=a,S,{\textbf {X}}\right\}\), \(a=0,1\). By assumption, it can be seen that for RCT, \(\mu _1({\textbf {X}},S=1)-\mu _0({\textbf {X}},S=1)=\tau ({\textbf {X}})\). However, this equation may not hold in RWD if unmeasured confounding exists. Define the confounding function \(u_c({\textbf {X}})=\mu _1({\textbf {X}},S=0)-\mu _0({\textbf {X}},S=0)-\tau ({\textbf {X}})\). It can be seen that \(u_c({\textbf {X}})\) captures the unmeasured confounding effect. The above formulations can be summarized into
$$\begin{aligned} \mu _1({\textbf {X}},S)-\mu _0({\textbf {X}},S)=\tau ({\textbf {X}})+(1-S)u_c({\textbf {X}}). \end{aligned}$$
By this formulation, we assume the following model on the failure time
$$\begin{aligned} \log \widetilde{T}=\mu _0({\textbf {X}},S) +A\tau ({\textbf {X}})+A(1-S)u_c({\textbf {X}})+\epsilon , \end{aligned}$$
(1)
where \(\mathbb {E}(\epsilon |{\textbf {X}},A,S)=0\), and \(\mathbb {E}(\epsilon ^2|{\textbf {X}},A,S)\) is finite. Taking expectation conditional on \(({\textbf {X}},S)\) on both sides of this model leads
$$\begin{aligned} \mathbb {E}\left\{ \log (\widetilde{T})|S,{\textbf {X}}\right\} =\mu _0({\textbf {X}},S) +e({\textbf {X}},S)\tau ({\textbf {X}})+e({\textbf {X}},S)(1-S)u_c({\textbf {X}}), \end{aligned}$$
(2)
where \(e({\textbf {X}},S)=\mathbb {E}(A|{\textbf {X}},S)\) is the propensity score. Calculating (1) minus (2) leads to
$$\begin{aligned} \log (\widetilde{T})=\mu ({\textbf {X}},S) +\left\{ A-e({\textbf {X}},S)\right\} \tau ({\textbf {X}})+\left\{ A-e({\textbf {X}},S)\right\} (1-S)u_c({\textbf {X}})+\epsilon , \end{aligned}$$
(3)
where \(\mu ({\textbf {X}},S)=\mathbb {E}\left\{ \log (\widetilde{T})|S,{\textbf {X}}\right\}\), \(\mathbb {E}(\epsilon |{\textbf {X}},A,S)=0\), and \(\mathbb {E}(\epsilon ^2|{\textbf {X}},A,S)<\infty\). Based on assumption \({\textbf {(A0)}}\), the above-induced formulation (3) is an accelerated failure time (AFT) model. AFT model is a natural choice for clinical decision-making, because it has an intuitive regression interpretation on failure time. There is rich literature considering the AFT model for observational studies (Henderson et al. 2020; Hu et al. 2021; Simoneau et al. 2020; Yang et al. 2020a). Estimation of the AFT model with an unspecified error distribution has been studied extensively. Here, we adopt the weighted least squares (LS) approach (Stute 1993) which is computationally more feasible.
[Style2 Style3 Style3]Remark 2
More generally, instead of a logarithmic transformation on failure time, any other known monotone transformation can be considered. Then, the definition of HTE and assumption \({\textbf {(A0)}}\) should be correspondingly modified.
In (3), we aim to estimate the HTE \(\tau\) and the confounding function \(u_c\), with \(e\) and \(\mu\) being the nuisance functions. First, we make the following assumptions for modelling heterogeneous treatment effects and unmeasured confounding effects.
\({\textbf {(M0)}}\)
\(u_c({\textbf {X}})\) can be modelled by \({\textbf {X}}^{\textrm{T}}\varvec{\beta }\), and \(\tau ({\textbf {X}})\) can be modelled by \({\textbf {X}}^{\textrm{T}}\varvec{\alpha }\), where \(\varvec{\beta }\), \(\varvec{\alpha }\in \mathbb {R}^p\).
Define the parameters of interests be \(\varvec{\theta }=(\varvec{\alpha }^{\textrm{T}},\varvec{\beta }^{\textrm{T}})^{\textrm{T}}\); nuisance functions be \(\eta =(e,\mu )\). Let \({\textbf {Z}}=({\textbf {X}},S)\), and \({\textbf {U}}=({\textbf {X}},(1-S){\textbf {X}})\). Then, the weighted loss function is
$$\begin{aligned} \ell (\varvec{\theta },\eta |\mathcal{O}\mathcal{B})=\sum _{i=1}^nw_{i}\left[ \log (T_{(i)})-\mu ({\textbf {Z}}_{(i)})-\left\{ A_{(i)}-e({\textbf {Z}}_{(i)})\right\} {\textbf {U}}_{(i)}^{\textrm{T}}\varvec{\theta }\right] ^2, \end{aligned}$$
(4)
where \(T_{(1)}\le T_{(2)}\le ...\le T_{(n)}\), and \({\textbf {Z}}_{(i)}\), \({\textbf {U}}_{(i)}\) are in corresponding order, \(w_{i}\) is defined as follows
$$\begin{aligned} w_{1}=\frac{\delta _{(1)}}{n},\quad w_{i}=\frac{\delta _{(i)}}{n-i+1}\prod _{j=1}^{i-1}\left( \frac{n-j}{n-j+1}\right) ^{\delta _{(j)}},\quad i=2,3,...,n. \end{aligned}$$
Suppose that the nuisance function \(\eta\) can be pre-estimated, then we propose to use the following penalized regression to get the estimate. It can simultaneously select important variables and determine whether the unmeasured confounding exists:
$$\begin{aligned} \ell _{\lambda _1,\lambda _2}(\varvec{\theta },\widehat{\eta }|\mathcal{O}\mathcal{B})=\ell (\varvec{\theta },\widehat{\eta }|\mathcal{O}\mathcal{B}) +\sum _{j=1}^{p}\rho (\vert \theta _j\vert ;\lambda _1)+\sum _{j=p+1}^{2p}\rho (\vert \theta _j\vert ;\lambda _2). \end{aligned}$$
(5)
where \(\rho (t;\lambda )\) is a penalty function with tuning parameters \(\lambda >0\) to recover sparsity, various kinds of penalty functions can be used to derive sparse and unbiased estimates, such as adaptive Lasso (Zou 2006), SCAD (Fan and Peng 2004), and MCP (Zhang 2010). It can be seen that the penalty function in (5) consists of two parts with tuning parameters \(\lambda _1\), \(\lambda _2\) respectively. The first part corresponds with the parameter of HTE, i.e., \(\varvec{\alpha }\), and the second part corresponds with the parameter of unmeasured confoundings, i.e., \(\varvec{\beta }\). By adopting penalties respectively, the method can fit in with a more general case where the sizes of coefficient in HTE and confounding function are different. A Similar strategy can be found in Cheng et al. (2023). The final estimate can be written as
$$\begin{aligned} \widehat{\varvec{\theta }}=\arg \min _{\varvec{\theta }}\ell _{\lambda _1,\lambda _2}(\varvec{\theta },\widehat{\eta }|\mathcal{O}\mathcal{B}), \end{aligned}$$
(6)
where the tuning parameters \(\lambda _1\) and \(\lambda _2\) can be selected by criteria such as AIC, BIC, and cross-validation (CV).

3 Theoretical properties

Denote that the true parameters be \(\varvec{\theta }^*=(\varvec{\alpha }^{*{\textrm{T}}},\varvec{\beta }^{*{\textrm{T}}})^{\textrm{T}}\), and true nuisance functions be \(\eta ^*\). Define index sets of non-zero parameters as follows: \(\mathcal {D}=\left\{ 1\le j\le 2p|\theta ^*_j\ne 0\right\}\) with element number \(d_{n}\), \(\mathcal {D}_1=\left\{ 1\le j\le p|\alpha ^*_j\ne 0\right\}\) with element number \(d_{1n}\), \(\mathcal {D}_2=\left\{ 1\le j\le p|\beta ^*_j\ne 0\right\}\) with element number \(d_{2n}\). Following the notations in Stute (1996), let \(G\) be the probability distribution function (p.d.f) of \(C\), with \(\tau _{G}=\inf \left\{ x:G(x)=1\right\}\), F be the p.d.f of \(\widetilde{T}\), with \(\tau _{F}=\inf \left\{ x:F(x)=1\right\}\), and \(H\) be the p.d.f of \(T\), with \(\tau _H=\inf \left\{ x:H(x)=1\right\}\). Let \(F^0\in \mathcal {P}\) be the p.d.f of \((\widetilde{{\textbf {Z}}},\widetilde{T})\), where \(\widetilde{{\textbf {Z}}}=({\textbf {Z}},A)\). Define
$$\begin{aligned} \widetilde{F}^0({\textbf {z}},t)=\left\{ \begin{array}{lc} F^0({\textbf {z}},t), & t<\tau _H,\\ F^0({\textbf {z}},\tau _H-)+F^0({\textbf {z}},\tau _H)I(\tau _H\in \mathcal {H}),& t\ge \tau _H,\\ \end{array} \right. \end{aligned}$$
with \(\mathcal {H}\) denoting the set of atoms of \(H\), possibly empty. Define the score function
$$\begin{aligned} \phi _j(\widetilde{{\textbf {Z}}},\widetilde{T};\varvec{\theta }^*,\eta ^*)=\left\{ A-e^*({\textbf {Z}})\right\} U_j\left[ \log (\widetilde{T})-\mu ^*({\textbf {Z}})-\left\{ A-e^*({\textbf {Z}})\right\} {\textbf {U}}^{\textrm{T}}\varvec{\theta }^*\right] , \end{aligned}$$
where \(j=1,2,...,2p\). It can be seen that \(\mathbb {E}\varvec{\phi }(\widetilde{{\textbf {Z}}},\widetilde{T};\varvec{\theta }^*,\eta ^*)={\textbf {0}}\), where \(\varvec{\phi }=\left( \phi _j,j=1,2,...,2p\right)\). Define \(\gamma _0(y)=\exp \left\{ \int ^{y-}_0\left\{ 1-H(v)\right\} ^{-1}\widetilde{H}^0(dv)\right\}\), where \(\widetilde{H}^{0}(y)=\Pr (T\le y, \delta =0)\).
(B0)
(i) \(\Pr (\widetilde{T}\le C|{\textbf {X}},S,A,\widetilde{T})=\Pr (\widetilde{T}\le C|\widetilde{T})\).
(ii) The p.d.f. F and G have no jump in common, and \(\tau _F<\tau _G\).
(iii) \(\mathbb {E}\left\{ \phi _j(\widetilde{{\textbf {Z}}},T;\varvec{\theta }^*,\eta ^*)\gamma _0(T)\delta \right\} ^2<\infty\).
(iv) Let \(g(y)=\int _0^{y-}\left\{ 1-H(w)\right\} ^{-1}\left\{ 1-G(w)\right\} ^{-1}G(dw)\). It holds that
$$\begin{aligned} \int |\phi _j({\textbf {z}},w;\varvec{\theta }^*,\eta ^*)|\sqrt{g(w)}\widetilde{F}^0(d{\textbf {z}},dw)<\infty . \end{aligned}$$
(B1)
(i) The eigenvalues of \(\mathbb {E}\left[ \left\{ A-e^*({\textbf {Z}})\right\} ^2{\textbf {U}}{\textbf {U}}^{\textrm{T}}\right]\) are larger than a positive constant \(c_1\).
(ii) The eigenvalues of \(\mathbb {E}{\textbf {U}}{\textbf {U}}^{\textrm{T}}\) are smaller than a positive constant \(c_2\).
(B2)
The penalty function satisfies the following properties.
(i) \(\rho (x;\lambda )\) is nondecreasing in \(x\in [0,\infty )\) and \(\rho (0;\lambda )=0\).
(ii) Let \(\dot{\rho }(x;\lambda )=\partial \rho (x;\lambda )/\partial x\). It exists and is bounded in \(x\in (0,\infty )\). In addition, \(\dot{\rho }(x;\lambda )/\lambda >0\), as \(x\rightarrow 0+\), \(n\rightarrow \infty\), and \(|\dot{\rho }(x_1;\lambda )-\dot{\rho }(x_2;\lambda )|\le O(1)\lambda |x_1-x_2|\), for \(x_1,x_2\in (0,\infty )\).
(iii) Let \(\ddot{\rho }(x;\lambda )=\partial ^2\rho (x;\lambda )/\partial x^2\). It exists and is bounded in \(x\in (\gamma _1\lambda ,\infty )\), where \(\gamma _1>0\) is a constant. It holds that \(|\ddot{\rho }(x_1;\lambda )-\ddot{\rho }(x_2;\lambda )|\le O(1)|x_1-x_2|\), for \(x_1,x_2\in (\gamma _1\lambda ,\infty )\).
(B3)
The pre-estimated nuisance parameter \(\widehat{\eta }\) is independent of the samples used to build the loss function. Considering the abuse of notation, we continue to use n to denote the sample size for constructing the loss function and assume that the pre-estimated nuisance parameter \(\hat{\eta }\) is obtained from another sample of size \(r_\eta n\), where \(r_\eta n\) is bounded by a positive constant. Let \(R_{e,n}\) be the convergence rate of \(\Vert \widehat{e}-e^*\Vert _\infty\), \(R_{\mu ,n}\) be the convergence rate of \(\Vert \widehat{\mu }-\mu ^*\Vert _\infty\). Define rate \(R_{n}=\max \left\{ R_{\mu ,n},R_{e,n}\Vert \varvec{\theta }^*\Vert _2,\sqrt{p/n}\right\}\). The real parameter satisfies \(\min _{j\in \mathcal {D}_1}|\alpha _j^*|/\lambda _1\rightarrow \infty\), \(\min _{j\in \mathcal {D}_2}|\beta _j^*|/\lambda _2\rightarrow \infty\), as \(n\rightarrow \infty\). Additional conditions are listed in the followings.
(i) \(\mathop {\max }\limits _{j\in \mathcal {D}_1}\left\{ \dot{\rho }(|\alpha _j^*|;\lambda _1)\right\} =O(R_{n}/\sqrt{d_{1n}})\), \(\mathop {\max }\limits _{j\in \mathcal {D}_2}\left\{ \dot{\rho }(|\beta _j^*|;\lambda _2)\right\} =O(R_{n}/\sqrt{d_{2n}})\).
(ii) \(\mathop {\max }\limits _{j\in \mathcal {D}_1}\left\{ |\ddot{\rho }(|\alpha _j^*|;\lambda _1)|\right\} =o(1)\), \(\mathop {\max }\limits _{j\in \mathcal {D}_2}\left\{ |\ddot{\rho }(|\beta _j^*|;\lambda _2)|\right\} =o(1)\).
(iii) \(\mathop {\max }\limits _{j\in \mathcal {D}_1}\left\{ \dot{\rho }(|\alpha _j^*|;\lambda _1)\right\} =O(1/\sqrt{nd_{1n}})\), \(\mathop {\max }\limits _{j\in \mathcal {D}_2}\left\{ \dot{\rho }(|\beta _j^*|;\lambda _2) \right\} =O(1/\sqrt{nd_{2n}})\).
(iv)\(\sqrt{n}R_{n}^2=o(1)\), and \(\sqrt{n}\max \left\{ \lambda _1,\lambda _2\right\} R_{n}=o(1)\).
(v) \(\mathbb {E}|\phi _j(\widetilde{{\textbf {Z}}},\widetilde{T};\varvec{\theta },\eta )-\phi _j(\widetilde{{\textbf {Z}}},\widetilde{T};\varvec{\theta }^*,\eta ^*)|^2\le \left( \Vert \varvec{\theta }-\varvec{\theta }^*\Vert _2\vee \Vert \eta -\eta ^*\Vert _\infty \right) ^bc_3\), \(j\in \mathcal {D}\), where b and \(c_3\) are positive constants. In addition, \(\sqrt{d_{n}}R_{n}^{b/2}=o(1)\), \(\sqrt{d_{n}}n^{-1/2+1/q}=o(1)\), \(q>2\).
(B4)
In what follows, we use \(\Vert \cdot \Vert _{Q,q}\) to denote the \(L^q(Q)\) norm. The uniform entropy numbers for set \(\mathcal {F}\) with radius \(\xi >0\) under \(L^q(Q)\) norm are defined as \(\sup _Q \log N(\xi , \mathcal {F}, \Vert \cdot \Vert _{Q,q})\), where \(N(\xi , \mathcal {F}, \Vert \cdot \Vert _{Q,q})\) is the corresponding covering number. Let \(\varTheta =\left\{ \varvec{\theta }:\Vert \varvec{\theta }-\varvec{\theta }^*\Vert _2\le R_{n}c_4\right\}\), where \(c_4\) is a positive constant. Define class
$$\begin{aligned} \mathcal {F}_{1,\eta }=\left\{ \phi _j(\cdot ;\varvec{\theta },\eta ):j\in \mathcal {D},\varvec{\theta }\in \varTheta \right\} , \end{aligned}$$
with measurable envelop \(F_{1,\eta }\). It satisfies \(\Vert F_{1,\eta }\Vert _{F^0,q}\le c_5\) where \(c_5\) is a positive constant and \(F^0\in \mathcal {P}\). It holds that for all \(0<\xi \le 1\), the uniform entropy number of \(\mathcal {F}_{1,\eta }\) obeys
$$\begin{aligned} \sup _{Q\in \mathcal {P}}\log N(\xi \Vert F_{1,\eta }\Vert _{Q,2},\mathcal {F}_{1,\eta },\Vert \cdot \Vert _{Q,2})\le v\log \frac{c_6}{\xi }, \end{aligned}$$
where \(c_6\) is a positive constant.
[Style2 Style3 Style3]Remark 3
\({\textbf {(B0)}}\) guarantees that Stute’s empirical probability measure converges to \(F^0\) (Stute 1993). In addition, it assures the asymptotic normality of \(\sum _{i=1}^nw_{i}\phi _j(\widetilde{{\textbf {Z}}}_{(i)},T_{(i)})\) given real nuisance functions (Stute 1996). \({\textbf {(B0)}}\)(i) assumes that the censoring variable is conditionally independent of \(({\textbf {X}},A,S)\) given the failure time \(\widetilde{T}\). By contrast, the utilization of the Inverse Censoring Probability Weight (IPCW) of the form \(\delta /\Pr (C>T|{\textbf {X}},A,S,T)\) often requires \(C\perp \widetilde{T}|S,A,{\textbf {X}}\) instead. While Stute’s weights are fully non-parametric, using the IPCW requires estimating of the survival function for the censoring time, \(\Pr (C>t|{\textbf {X}},A,S)\), which may introduce additional model assumptions. We discuss the further development based on IPCW in Sect. 6. \({\textbf {(B1)}}\) puts constraints on eigenvalues of design matrices. \({\textbf {(B2)}}\) states the basic properties of the penalty function. Many penalty functions, such as SCAD and MCP, can meet these properties. \({\textbf {(B3)}}\) contains several assumptions on nuisance functions, penalty function and convergence rate. It should be noted that the assumption of independent pre-estimated nuisance parameter can be reached by data splitting. \({\textbf {(B3)}}\)(i)-(iii) naturally hold when the signal of the real parameter is strong enough. \({\textbf {(B3)}}\)(iv) requires the convergence rate of the nuicance function estimate to be at least faster than \(n^{-1/4}\), and \(p=o(n^{1/2})\). \({\textbf {(B4)}}\) is used to reach the condition in Lemma 6.2 in Chernozhukov et al. (2018).
Theorem 1
(Consistency) If \({\textbf {(M0)}}\), \({\textbf {(A0)}}\), \({\textbf {(B0)}}\), \({\textbf {(B1)}}\), \({\textbf {(B2)}}\) and \({\textbf {(B3)}}\) (i)(ii) hold, then \(\Vert \widehat{\varvec{\theta }}-\varvec{\theta }^*\Vert _2=O_p(R_{n})\), where \(R_{n}\) is defined in (B3).
Theorem 2
(Sparsity recovery) Suppose the result in Theorem 1 holds. If \(\min \left\{ \lambda _1,\lambda _2\right\} R_{n}^{-1}\rightarrow \infty\) as \(n\rightarrow \infty\), then \(\Pr \left( \widehat{\varvec{\theta }}_{\mathcal {D}^c}={\textbf {0}}\right) \rightarrow 1\).
To derive the asymptotic normality of the proposed estimator, we introduce some notations first. Let \(\widetilde{H}^{1}({\textbf {z}},y)=\Pr (\widetilde{{\textbf {Z}}}\le {\textbf {z}}, T\le y, \delta =1)\),
$$\begin{aligned} \gamma _{1j}(y)= & \frac{1}{1-H(y)}\int 1_{\left\{ y<v\right\} }\phi _j({\textbf {z}},v;\varvec{\theta }^*,\eta ^*)\gamma _0(v)\widetilde{H}^{1}(d{\textbf {z}},dv),\\ \gamma _{2j}(y)= & \int \int \frac{1_{\left\{ v<y,v<w\right\} }\phi _j({\textbf {z}},w;\varvec{\theta }^*,\eta ^*)\gamma _0(v)}{\left\{ 1-H(v)\right\} ^2}\widetilde{H}^0(dv)\widetilde{H}^{1}(d{\textbf {z}},dw), \end{aligned}$$
and \(\varvec{\gamma }_{1}(y)=\left\{ \gamma _{1j}(y),j=1,2,...,2p\right\}\), and \(\varvec{\gamma }_{2}(y)=\left\{ \gamma _{2j}(y),j=1,2,...,2p\right\}\). Define \({\textbf {V}}={\textbf {B}}^{-1}\varvec{\varSigma }_{\mathcal {D}}{\textbf {B}}^{-1}\), where
$$\begin{aligned} {\textbf {B}}= & \mathbb {E}\left[ \left\{ A-e^*({\textbf {Z}})\right\} ^2{\textbf {U}}_{\mathcal {D}}{\textbf {U}}_{\mathcal {D}}^{\textrm{T}}\right] ,\\ \varvec{\varSigma }= & \text{ Var }\left\{ \varvec{\phi }(\widetilde{{\textbf {Z}}},\widetilde{T};\varvec{\theta }^*,\eta ^*)\gamma _0(\widetilde{T})\delta +\varvec{\gamma }_1(\widetilde{T})(1-\delta )-\varvec{\gamma }_2(\widetilde{T})\right\} . \end{aligned}$$
Theorem 3
(Asymptotic normality) Suppose the result of consistency and sparsity recovery hold. Assume that \({\textbf {(M0)}}\), \({\textbf {(A0)}}\), \({\textbf {(B0)}}\)-\({\textbf {(B2)}}\) and \({\textbf {(B3)}}\) (iii)(iv)(v) hold. For any \({\textbf {q}}\in \mathbb {R}^{d_{n}}\), \(\Vert {\textbf {q}}\Vert _2<\infty\), if \(\sigma ^2{\textbf {q}}^{\textrm{T}}{\textbf {V}}{\textbf {q}}\rightarrow \sigma _*^2\) as \(n\rightarrow \infty\), then
$$\begin{aligned} \sqrt{n}{\textbf {q}}^{\textrm{T}}\left( \widehat{\varvec{\theta }}_{\mathcal {D}}-\varvec{\theta }^*_{\mathcal {D}}\right) \rightarrow _d N(0,\sigma _*^2). \end{aligned}$$
We need additional conditions to derive the asymptotic properties of the RCT-only estimator. Since the conditions are similar to \({\textbf {(B0)}}\)-\({\textbf {(B4)}}\), details are presented in the Appendix. Define rate \(R_{n_1}=\max \left\{ R_{\mu ,n_1},R_{e,n_1}\Vert \varvec{\alpha }^*\Vert _2,\sqrt{p/n_1}\right\}\), and \({\textbf {V}}_r={\textbf {B}}_r^{-1}\varvec{\varSigma }_{r\mathcal {D}_1}{\textbf {B}}_r^{-1}\), where
$$\begin{aligned} {\textbf {B}}_r= & \mathbb {E}\left[ \left\{ A-e^*({\textbf {Z}})\right\} ^2{\textbf {X}}_{\mathcal {D}_1}{\textbf {X}}_{\mathcal {D}_1}^{\textrm{T}}|S=1\right] ,\\ \varvec{\varSigma }_r= & \text{ Var }\left\{ \varvec{\varphi }(\widetilde{{\textbf {X}}},\widetilde{T};\varvec{\alpha }^*,\eta ^*)\gamma _{r0}(\widetilde{T})\delta +\varvec{\gamma }_{r1}(\widetilde{T})(1-\delta )-\varvec{\gamma }_{r2}(\widetilde{T})|S=1\right\} . \end{aligned}$$
The definition of \(\varvec{\varphi }\), \(\gamma _{r0}\), \(\varvec{\gamma }_{r1}\), \(\varvec{\gamma }_{r2}\) are presented in Appendix for details.
Theorem 4
(Asymptotic normality for RCT-only estimator) Suppose the result of consistency with rate \(R_{n_1}\) and sparsity recovery hold. Assume that \({\textbf {(M0)}}\), \({\textbf {(A0)}}\), \({\textbf {(C0)}}\), \({\textbf {(C1)}}\) and \({\textbf {(C2)}}\) (in the Appendix) hold. For any \({\textbf {q}}\in \mathbb {R}^{d_{1n}}\), \(\Vert {\textbf {q}}\Vert _2<\infty\), if \(\sigma ^2{\textbf {q}}^{\textrm{T}}{\textbf {V}}_r{\textbf {q}}\rightarrow \sigma _{r*}^2\) as \(n_1\rightarrow \infty\), then
$$\begin{aligned} \sqrt{n_1}{\textbf {q}}^{\textrm{T}}\left( \widehat{\varvec{\alpha }}^{rct}_{\mathcal {D}_1}-\varvec{\alpha }^*_{\mathcal {D}_1}\right) \rightarrow _d N(0,\sigma _{r*}^2). \end{aligned}$$
Theorem 5
(Efficiency gain) Suppose the results in Theorem 3 and 4 hold. If there is no censoring, it can be seen that \(\varvec{\varSigma }_{r\mathcal {D}_1}={\textbf {B}}_r\), and \(\varvec{\varSigma }_{\mathcal {D}}={\textbf {B}}\). For any \({\textbf {q}}\in \mathbb {R}^{d_{1n}}\), \(\Vert {\textbf {q}}\Vert _2<\infty\), with probability converging to 1, we have
$$\begin{aligned} \text{ Var }(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}_{\mathcal {D}_1})\le \text{ Var }(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}^{rct}_{\mathcal {D}_1}), \end{aligned}$$
(7)
where the equality holds if and only if there exists a \(d_{2n}\times d_{1n}\) constant matrix \({\textbf {Q}}\), such that when \(S=0\), \({\textbf {X}}_{\mathcal {D}_1}={\textbf {Q}}^{\textrm{T}}{\textbf {X}}_{\mathcal {D}_2}\). Specially, when \(\mathcal {D}_1\subset \mathcal {D}_2\), the equality holds. When \({\mathcal {D}_2}=\emptyset\), under (B1)(i), the inequality in (7) strictly holds.
[Style2 Style3 Style3]Remark 4
Censoring leads to a more complicated form of variance, thus it is difficult to see the efficiency gain directly. Let \({\textbf {B}}=({\textbf {B}}_{11},{\textbf {B}}_{12};{\textbf {B}}_{21},{\textbf {B}}_{22})\) where \({\textbf {B}}_{11}=\mathbb {E}\left\{ A-e^*({\textbf {Z}})\right\} ^2{\textbf {X}}_{\mathcal {D}_1}{\textbf {X}}_{\mathcal {D}_1}^{\textrm{T}}\), \({\textbf {B}}_{12}=\mathbb {E}(1-S)\left\{ A-e^*({\textbf {Z}})\right\} ^2{\textbf {X}}_{\mathcal {D}_1}{\textbf {X}}_{\mathcal {D}_2}^{\textrm{T}}\), \({\textbf {B}}_{22}=\mathbb {E}(1-S)\left\{ A-e^*({\textbf {Z}})\right\} ^2{\textbf {X}}_{\mathcal {D}_2}{\textbf {X}}_{\mathcal {D}_2}^{\textrm{T}}\). Define \(\varvec{\varOmega }_{11}=\left( {\textbf {B}}_{11}-{\textbf {B}}_{12}{\textbf {B}}_{22}^{-1}{\textbf {B}}_{12}^{\textrm{T}}\right) ^{-1}\). Let \(\varvec{\varSigma }_{11}\) be the submatrix of \(\varvec{\varSigma }\) with columns and rows corresponding to \(\varvec{\alpha }^*_{\mathcal {D}_1}\), \(\varvec{\varSigma }_{12}\) be the submatrix with columns corresponding to \(\varvec{\alpha }^*_{\mathcal {D}_1}\) and rows corresponding to \(\varvec{\beta }^*_{\mathcal {D}_2}\), \(\varvec{\varSigma }_{22}\) be the submatrix with columns and rows corresponding to \(\varvec{\beta }^*_{\mathcal {D}_2}\). Let \(r_{_S}=\Pr (S=1)\), and \(\varDelta \varvec{\varSigma }_{11}= \varvec{\varSigma }_{11}-{\textbf {B}}_{12}{\textbf {B}}_{22}^{-1}\varvec{\varSigma }_{12}^{\textrm{T}}-\varvec{\varSigma }_{12}{\textbf {B}}_{22}^{-1}{\textbf {B}}_{12}^{\textrm{T}}+{\textbf {B}}_{12}{\textbf {B}}_{22}^{-1}\varvec{\varSigma }_{22}{\textbf {B}}_{22}^{-1}{\textbf {B}}_{12}^{\textrm{T}}\). Generally, if
$$\begin{aligned} {\textbf {B}}_r^{-1}\varvec{\varSigma }_r{\textbf {B}}_r^{-1}-r_{_S}\varvec{\varOmega }_{11}\varDelta \varvec{\varSigma }_{11}\varvec{\varOmega }_{11}\ge 0, \end{aligned}$$
that is, the matrix is semi-definite, then the variance of the proposed estimate \(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}_{\mathcal {D}_1}\) will not larger than the RCT-only estimate \(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}_{\mathcal {D}_1}^{rct}\). Specially when \(\mathcal {D}_2=\emptyset\), \({\textbf {B}}_{11}={\textbf {B}}_r\) and \(\varvec{\varSigma }_{11}=\varvec{\varSigma }_r\), i.e., the distributions of \((A,{\textbf {X}})\) and censoring in RCT and RWD are similar, then the variance of the proposed estimate \(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}_{\mathcal {D}_1}\) will be rigorously smaller than that of the RCT-only estimate \(\sqrt{n}{\textbf {q}}^{\textrm{T}}\widehat{\varvec{\alpha }}_{\mathcal {D}_1}^{rct}\).
[Style2 Style3 Style3]Remark 5
(Variance estimation) The theoretical variances obtained in Theorems 3 and 4 are not easy to estimate based on the formulations. Following Huang et al. (2006), we estimate the variance using the nonparametric 0.632 bootstrap (Efron and Tibshirani 1993), in which approximately \(0.632n\) samples from the \(n\) observations are randomly selected without replacement.

4 Simulation

We conduct simulation studies to evaluate the performance of the proposed method including efficiency gain (compared with the estimators that only use RCT data), parameters estimation, variable selection and identification of unmeasured confounding. The data is generated from the following model
$$\begin{aligned} \log (\widetilde{T})={\mu _0({\textbf {X}},S)}+A{\textbf {X}}^{\textrm{T}}\varvec{\alpha }^*+{(1-S)}u+\epsilon , \end{aligned}$$
where \(\mu _0({\textbf {X}},S)=\sin (X_1)+0.2X_4^2-0.5{\textbf {X}}^{\textrm{T}}\varvec{\alpha }^*-0.5(1-S){\textbf {X}}^{\textrm{T}}\varvec{\beta }^*\). Here \({\textbf {X}}\) is observable, while u is the unmeasured confounding effect. We generate \(n=2500\) samples from this model with the following distributions: \(S\sim Bernoulli(0.2)\), \(A\sim Bernoulli(0.5)\), \({\textbf {X}}|A\sim N(0.2A\times ({\textbf {1}}_{8},{\textbf {0}}_{p-8}),\varvec{\varSigma })\), \(u|A\sim N(A{\textbf {X}}^{\textrm{T}}\varvec{\beta }^*,\varvec{\varSigma })\), and \(\epsilon \sim N(0,1)\), where \(\varvec{\varSigma }=(0.3^{|i-j|},i,j=1,2,...,p)\). Let \(\varvec{\alpha }^*=Signal\times ({\textbf {1}}_{4},-{\textbf {1}}_{4},{\textbf {0}}_{p-8})\), \(\varvec{\beta }^*=Signal\times ({\textbf {1}}_{2},-{\textbf {1}}_{2},{\textbf {0}}_{p-4})\), \(Signal=2\), provided that unmeasured confounding effect exists, otherwise \(\varvec{\beta }^*={\textbf {0}}_{p}\). The dimension of \({\textbf {X}}\) is considered to be \(p\in \left\{ 20,50\right\}\), the censored time \(\log C\sim \text {Unif}[t_0,t_1]\), where \(t_0\), \(t_1\) adjust the censored rate to be around 20% or 40%. We adopt the MCP function as the penalty function, i.e., \(\rho (t;\lambda )=\lambda \int _{0}^{t}\big (1-x/(\gamma \lambda )\big )_+dx\).

4.1 Finite-Sample studies

For the proposed method, cross-validation and BIC to select the tuning parameters and refer to them as RL.cv and RL.bic, respectively. We also implement the analysis that ignores the unmeasured confounding effect and refers to it as RL.NAI. In addition, we compare the proposed method with the following methods:
  • Outcome-adjusted method: define the adjusted outcome
    $$\begin{aligned} \widetilde{T}^{adjust}=\frac{A\left\{ \log (\widetilde{T})-\mu _1({\textbf {Z}})\right\} }{e({\textbf {Z}})}+\mu _1({\textbf {Z}})-\frac{(1-A)\left\{ \log (\widetilde{T})-\mu _0({\textbf {Z}})\right\} }{1-e({\textbf {Z}})}-\mu _0({\textbf {Z}}). \end{aligned}$$
    Under assumption \({\textbf {(A0)}}\) and \({\textbf {(M0)}}\), \(\mathbb {E}\left( \widetilde{T}^{adjust}|{\textbf {Z}}\right) ={\textbf {X}}^{\textrm{T}}\varvec{\alpha }+(1-S){\textbf {X}}^{\textrm{T}}\varvec{\beta }\). Then we can build the penalized regression model based on this equation (similar to the construction of the proposed method). We use the same penalty function as the proposed method to identify unmeasured confounding effect and adopt CV and BIC to select the tuning parameters. This method is referred to as OA.cv and OA.bic respectively.
  • AFT model with \(\mu _0\): under assumption \({\textbf {(A0)}}\) and \({\textbf {(M0)}}\), it holds that
    $$\begin{aligned} \mathbb {E}\left\{ \log (\widetilde{T})-\mu _0({\textbf {Z}})|{\textbf {Z}},A=1\right\} ={\textbf {X}}^{\textrm{T}}\varvec{\alpha }+(1-S){\textbf {X}}^{\textrm{T}}\varvec{\beta }. \end{aligned}$$
    Then we can build the AFT model based on this equation. The estimation procedures are the same as the outcome-adjusted method. This method is referred to as GM0.cv and GM0.bic respectively.
  • AFT model with \(\mu _1\): under assumption \({\textbf {(A0)}}\) and \({\textbf {(M0)}}\), it holds that
    $$\begin{aligned} \mathbb {E}\left\{ \mu _1({\textbf {Z}})-\log (\widetilde{T})|{\textbf {Z}},A=0\right\} ={\textbf {X}}^{\textrm{T}}\varvec{\alpha }+(1-S){\textbf {X}}^{\textrm{T}}\varvec{\beta }. \end{aligned}$$
    Then we can build the AFT model based on this equation. The estimation procedures are the same as the outcome-adjusted method. This method is referred to as GM1.cv and GM1.bic respectively.
  • The meta estimates: combine GM0.cv and GM1.cv (GM0.bic and GM1.bic) by weights of sample size. This method is referred to as Meta.cv and Meta.bic respectively.
  • AFT model with \(\mu _0\), \(\mu _1\): under assumption \({\textbf {(A0)}}\) and \({\textbf {(M0)}}\), it holds that
    $$\begin{aligned} \mathbb {E}\left\{ \mu _1({\textbf {Z}})-\mu _0({\textbf {Z}})|{\textbf {Z}}\right\} ={\textbf {X}}^{\textrm{T}}\varvec{\alpha }+(1-S){\textbf {X}}^{\textrm{T}}\varvec{\beta }. \end{aligned}$$
    Then we can build the AFT model based on this equation. The following procedures are the same as the outcome-adjusted method. This method is referred to as GM01.cv and GM01.bic respectively.
We calculate the RCT-only estimates for all these methods and use CV to select tuning parameters referred to as RL.RCT, OA.RCT, GM0.RCT, GM1.RCT, Meta.RCT, GM01.RCT, respectively. In addition, assuming that we correctly select the variables, we can calculate the oracle estimates referred to as RL.or, RL.NAIor, OA.or, GM0.or, GM1.or, Meta.or, GM01.or, RL.RCTor, OA.RCTor, GM0.RCTor, GM1.RCTor, Meta.RCTor, GM01.RCTor, respectively.
For the estimation of HTE’s parameter, we use mean square error (MSE) to evaluate the performance (variance) of estimation and use false discovery rate (FDR) to evaluate the performance of variable selection. The definitions are as follow: for simulation times \(b=1,2,...,B\), \(\text {RMSE}=(\text {MSE})^{1/2}\), where \(\text {MSE}=(Bp)^{-1}\sum _{b=1}^{B}\sum _{j=1}^{p}(\widehat{\alpha }_j^{(b)}-\alpha ^*_j)^2,\)
$$\begin{aligned} \text {FDR}=\frac{1}{B}\sum _{b=1}^{B}\frac{\Big |\left\{ j|\alpha ^*_j= 0,\widehat{\alpha }_j^{(b)}\ne 0\right\} \Big |}{\Big |\left\{ j|\widehat{\alpha }_j^{(b)}\ne 0\right\} \Big |}. \end{aligned}$$
We also record whether we correctly identify the existence of an unmeasured confounding effect, denoted by TIR. The definition is
$$\begin{aligned} {\text {TIR}=\frac{1}{B}\sum _{b=1}^{B}\left\{ 1\left( \widehat{\varvec{\beta }}={\textbf {0}},\varvec{\beta }^*={\textbf {0}}\right) +1\left( \widehat{\varvec{\beta }}\ne {\textbf {0}},\varvec{\beta }^*\ne {\textbf {0}}\right) \right\} .} \end{aligned}$$
The empirical results are based on \(B=500\) replications.
The simulation results are shown in Table 1 and 2. We make the following observations. (i) For the Oracle estimators, the estimators that utilize RWD act better than RCT-only estimators. The proposed Oracle estimator (RL.or) has the minimal RMSE in all settings compared with the estimates from other methods. (ii) The estimator that ignores the unmeasured confounding effect in RWD (RL.NAI) has the highest RMSE. This shows that ignorance of unmeasured confounding can lead to significant estimation error, confirming the necessity of identifying unmeasured confounding effect in RWD. (iii) For the methods to select tuning parameters, CV is competitive with BIC when there is no unmeasured confounding effect and better than BIC when there is unmeasured confounding effect. In the following, we just analyze the results from the CV. All estimators that utilize RWD have smaller RMSE than the RCT-only estimators. Among the reported estimators, the RMSE of the proposed estimate (RL.cv) is sensibly lower than other methods. The RMSE of OA.cv is the second lowest. The RMSE of Meta.cv is slightly higher than that of OA.cv. (iv) The proposed estimator (RL.cv) has a slightly lower/competitive FDR than that of OA.cv, and it is sensibly lower than that of other methods. (v) Based on the results of TIR, it can be seen that all methods can identify the case well when there is unmeasured confounding. When no unmeasured confounding effect exists, the proposed and outcome-adjusted methods perform better than other methods. (vi) Generally, the RMSEs and FDRs have no better performances when the censoring rate increases. If there is no unmeasured confounding effect, the estimators that utilize RWD gain more efficiency than the estimators in the case where an unmeasured confounding effect exists.
Table 1
The RMSE (\(\times 10^{2}\)) of the HTE estimation when \(Signal=2\) over 500 experiment replicates
Methods
with unmeasured confounding
no unmeasured confounding
p=20
p=50
p=20
p=50
CR=20%
CR=40%
CR=20%
CR=40%
CR=20%
CR=40%
CR=20%
CR=40%
RL.or
8.24
10.08
5.29
6.64
5.42
6.71
3.55
4.22
RL.RCTor
15.16
18.15
9.22
11.07
15.20
17.94
8.96
11.09
RL.NAIor
71.00
71.10
44.83
44.98
    
RL.cv
8.40
10.35
5.31
6.71
5.64
7.15
3.59
4.35
RL.bic
8.26
10.13
5.32
7.63
5.48
6.81
3.56
4.19
RL.RCT
15.67
19.05
9.29
11.78
15.71
18.95
9.22
11.82
RL.NAI
71.67
72.28
45.88
46.67
    
OA.or
11.30
12.97
7.10
8.54
8.87
10.72
5.71
7.04
OA.RCTor
22.23
25.97
15.82
18.49
22.32
25.95
15.75
18.57
OA.cv
11.34
13.10
7.13
8.60
8.90
10.81
5.73
7.05
OA.bic
11.33
13.03
7.13
9.12
8.89
10.75
5.72
7.04
OA.RCT
22.36
26.20
15.89
18.65
22.44
26.20
15.84
18.73
GM0.or
13.73
15.87
8.81
10.33
10.47
12.43
6.76
8.06
GM0.RCTor
24.92
29.09
17.40
20.61
24.92
29.09
17.40
20.42
GM0.cv
14.26
16.69
9.03
10.81
11.16
13.13
6.98
8.45
GM0.biv
13.78
16.20
8.97
11.41
10.74
12.54
6.85
8.10
GM0.RCT
24.81
29.10
17.35
21.02
24.81
29.10
17.32
20.85
GM1.or
11.76
13.79
7.54
9.23
9.52
11.35
6.13
7.40
GM1.RCTor
22.67
26.66
15.84
18.36
22.62
26.62
15.93
18.36
GM1.cv
12.35
14.68
7.81
9.86
9.99
11.98
6.34
7.77
GM1.bic
11.93
14.02
7.62
10.16
9.70
11.50
6.22
7.46
GM1.RCT
22.57
26.64
15.77
18.51
22.51
26.61
15.88
18.48
Meta.or
11.14
12.86
7.16
8.49
9.32
11.12
6.02
7.24
Meta.RCTor
22.34
25.93
15.68
18.19
22.31
25.93
15.67
18.12
Meta.cv
11.48
13.39
7.31
8.86
9.70
11.45
6.18
7.45
Meta.bic
11.21
13.04
7.25
9.20
9.51
11.20
6.11
7.27
Meta.RCT
22.17
25.77
15.59
18.35
22.13
25.77
15.58
18.28
GM01.or
13.77
17.05
8.95
11.30
11.50
14.86
7.49
9.77
GM01.RCTor
28.50
36.42
19.88
25.77
28.41
36.36
19.93
25.56
GM01.cv
14.63
18.69
9.66
13.40
12.49
16.61
8.06
11.80
GM01.bic
13.94
17.48
9.03
11.60
11.83
15.50
7.54
10.09
GM01.RCT
28.65
36.67
19.95
26.03
28.56
36.60
19.99
25.77
Some results are marked in bold to make it clear for readers to make comparison between different methods. The results behave the best except for the oracle estimates (i.e., smallest RMSE) are marked with underlines
In the table, CR represents cencoring rate. Among these methods, those with names starting with “RL" indicate the proposed model. RL.cv and RL.bic represent the proposed estimates under CV and BIC criterion respectively. RL.RCT represents the estimate merely based on RCT data. RL.NAI is the naive estimate which completely ignores unmeasured confounding effect. RL.or, RL.RCTor, and RL.NAIor are the oracle estimates of the integrative analysis, RCT-only analysis, and naive analysis respectively. Other methods with names starting with “OA”, “GM0”, “GM1”, “Meta” and “GM01” are introduced in detail in Sect. 4
Table 2
The averaged TIR(\(/\%\)) and FDR(\(/\%\)) when \(Signal=2\) over 500 experiment replicates
Index
Methods
with unmeasured confounding
no unmeasured confounding
p=20
p=50
p=20
p=50
CR=20%
CR=40%
CR=20%
CR=40%
CR=20%
CR=40%
CR=20%
CR=40%
TIR
RL.cv
100
100
100
100
97.8
97.3
98.5
98.1
RL.bic
100
100
100
100
97.8
98.6
99.2
98.8
OA.cv
100
100
100
100
98.8
97.8
99.2
99.0
OA.bic
100
100
100
100
99.4
99.4
99.2
99.1
GM0.cv
100
100
100
100
85.0
82.0
90.8
87.2
GM0.biv
100
100
100
100
95.4
95.4
97.8
97.6
GM1.cv
100
100
100
100
87.0
84.6
92.4
86.8
GM1.bic
100
100
100
100
96.6
94.2
98.4
95.2
Meta.cv
100
100
100
100
74.2
70.2
85.0
75.4
Meta.bic
100
100
100
100
92.0
89.6
96.2
92.8
GM01.cv
100
100
100
100
60.2
26.8
61.6
13.4
GM01.bic
100
100
100
100
88.4
67.8
92.6
64.2
FDR
RL.cv
0.28
0.67
0.07
0.71
0.32
0.93
0.13
0.37
RL.bic
0.07
0.20
0.04
0.51
0.07
0.20
0.02
0.09
RL.RCT
3.51
5.00
3.48
6.16
3.39
5.07
3.63
5.81
RL.NAI
1.31
2.61
0.87
2.28
    
OA.cv
0.04
0.11
0.04
0.18
0.09
0.37
0.04
0.07
OA.bic
0.04
0.04
0.02
0.11
0.02
0.09
0.00
0.00
OA.RCT
0.90
1.55
0.56
1.59
0.80
1.71
0.74
1.71
GM0.cv
1.37
2.71
1.63
3.85
0.90
1.91
1.15
3.00
GM0.biv
0.18
0.77
0.40
1.34
0.24
0.34
0.16
0.24
GM0.RCT
1.51
2.24
0.86
1.81
1.51
2.24
1.03
1.65
GM1.cv
1.14
2.26
0.92
3.35
0.68
1.49
0.84
2.51
GM1.bic
0.13
0.59
0.07
1.18
0.16
0.18
0.11
0.24
GM1.RCT
1.21
2.55
0.46
1.85
1.29
2.66
0.68
1.44
Meta.cv
2.45
4.71
2.50
6.80
1.54
3.24
1.98
5.23
Meta.bic
0.31
1.35
0.46
2.43
0.37
0.52
0.27
0.46
Meta.RCT
2.66
4.63
1.31
3.58
2.76
4.74
1.71
3.05
GM01.cv
1.69
7.05
3.99
19.3
1.38
5.50
2.51
16.6
GM01.bic
0.16
1.02
0.28
2.24
0.18
1.08
0.13
2.24
GM01.RCT
1.18
2.30
0.37
1.81
1.12
2.09
0.30
1.60
In the table, CR represents cencoring rate. TIR is the rate of correctly identifying the real case where unmeasured confounding effect exists or not. FDR is the false discovery rate of the HTE estimates. Among these methods, those with names starting with “RL" indicate the proposed model. RL.cv and RL.bic represent the proposed estimates under CV and BIC criterion respectively. RL.RCT represents the estimate merely based on RCT data. RL.NAI is the naive estimate which completely ignores unmeasured confounding effect. Other methods with names starting with “OA”, “GM0”, “GM1”, “Meta” and “GM01” are introduced in detail in Sect. 4
Additional simulation experiments considering a weaker signal strength of the coefficients (\(Signal=1\)), a more severer censoring rate (CR\(=60\%\)), and the log-logistic distribution of the survival time are presented in the supplementary materials in detail. The results show that the proposed method maintains its effectiveness across these settings. To summarize, the proposed method can identify unmeasured confounding effects well and gains more efficiency than the RCT-only estimators. The proposed estimator did well in cases including relatively high dimensions and severe censoring. In addition, it acts the best compared with the estimates from other reported methods.

4.2 Variance estimation

Simulations are implemented to evaluate the nonparametric bootstrap approach for variance estimation. The details of the estimation method are presented in Remark 5. We compute the variance estimates for the proposed method using two types of data: one combining RCT with RWD (denoted as RCT+RWD), and the other using RCT data alone. Here we take the bootstrap sample size of 500. In Table 3 and 4, we show the average of the point estimates (Mean), standard deviations (SD), the means of the bootstrap estimated standard deviations (SE), and the 0.95 coverage proportion (CP) based on 500 replications.
Upon examining Tables 3 and 4, it is evident that the bootstrap standard deviation estimates match the standard deviations of the estimates well. Furthermore, the variance of the estimates derived from the combined RCT+RWD dataset is observed to be lower than that obtained from RCT data alone. This variance reduction, or shrinkage, is particularly pronounced for the coefficients in \(\mathcal {D}_1\setminus \mathcal {D}_2\).
Table 3
The inference results of the proposed HTE estimate over 500 experiment replicates when \(p=20\)
Case
Dataset
Index
\(\alpha ^*_1=-2\)
\(\alpha ^*_2=-2\)
\(\alpha ^*_3=-2\)
\(\alpha ^*_4=-2\)
\(\alpha ^*_5=2\)
\(\alpha ^*_6=2\)
\(\alpha ^*_7=2\)
\(\alpha ^*_8=2\)
1
RCT
Bias
0.090
0.098
0.098
0.083
\(-\)0.079
\(-\)0.102
\(-\)0.095
\(-\)0.084
SD
0.167
0.185
0.193
0.193
0.192
0.190
0.182
0.177
SE
0.188
0.196
0.196
0.203
0.197
0.197
0.196
0.192
CP(95%)
0.939
0.921
0.926
0.936
0.934
0.933
0.933
0.939
RCT+RWD
Bias
0.008
0.006
0.006
0.035
0.010
0.005
0.005
0.013
SD
0.080
0.085
0.083
0.094
0.086
0.087
0.083
0.079
SE
0.081
0.084
0.083
0.090
0.088
0.084
0.084
0.081
CP(95%)
0.946
0.947
0.923
0.910
0.931
0.931
0.944
0.944
2
RCT
Bias
0.079
0.076
0.088
0.073
\(-\)0.083
\(-\)0.079
\(-\)0.094
\(-\)0.079
SD
0.208
0.223
0.234
0.233
0.240
0.214
0.215
0.221
SE
0.230
0.239
0.239
0.247
0.239
0.241
0.239
0.232
CP(95%)
0.959
0.959
0.928
0.949
0.925
0.959
0.955
0.949
RCT+RWD
Bias
0.009
0.006
0.009
0.035
0.010
0.007
0.004
0.012
SD
0.094
0.100
0.101
0.114
0.100
0.101
0.103
0.091
SE
0.097
0.099
0.101
0.110
0.111
0.100
0.100
0.099
CP(95%)
0.955
0.951
0.940
0.904
0.947
0.932
0.957
0.947
3
RCT
Bias
0.102
0.092
0.099
0.078
\(-\)0.076
\(-\)0.098
\(-\)0.086
\(-\)0.098
SD
0.172
0.184
0.194
0.197
0.187
0.190
0.185
0.182
SE
0.187
0.196
0.196
0.202
0.197
0.198
0.198
0.191
CP(95%)
0.933
0.931
0.927
0.931
0.944
0.929
0.944
0.929
RCT+RWD
Bias
0.040
0.021
0.030
\(-\)0.015
0.004
0.006
0.007
0.011
SD
0.153
0.160
0.160
0.164
0.081
0.082
0.088
0.080
SE
0.169
0.182
0.166
0.163
0.090
0.093
0.094
0.089
CP(95%)
0.958
0.960
0.949
0.947
0.967
0.960
0.956
0.960
4
RCT
Bias
0.091
0.081
0.087
0.069
\(-\)0.084
\(-\)0.080
\(-\)0.093
\(-\)0.078
SD
0.205
0.221
0.231
0.228
0.228
0.215
0.208
0.224
SE
0.230
0.241
0.238
0.245
0.239
0.239
0.238
0.231
CP(95%)
0.959
0.953
0.931
0.957
0.936
0.949
0.949
0.938
RCT+RWD
Bias
0.041
0.032
0.030
\(-\)0.027
0.011
0.012
0.011
0.011
SD
0.188
0.185
0.188
0.187
0.103
0.104
0.104
0.100
SE
0.216
0.230
0.204
0.202
0.122
0.123
0.122
0.118
CP(95%)
0.966
0.972
0.968
0.968
0.966
0.976
0.970
0.968
Case 1-4 represent (nuc, 20%CR), (nuc, 40%CR), (uc, 20%CR), (uc, 40%CR), where nuc means there is no unmeasured confounding while uc means there is unmeasured confounding. The SEs are marked in bold to make it clear for readers to make comparison between RCT and RCT+RWD
Table 4
The inference results of the proposed HTE estimate over 500 experiment replicates when \(p=50\)
Case
Dataset
Index
\(\alpha ^*_1=-2\)
\(\alpha ^*_2=-2\)
\(\alpha ^*_3=-2\)
\(\alpha ^*_4=-2\)
\(\alpha ^*_5=2\)
\(\alpha ^*_6=2\)
\(\alpha ^*_7=2\)
\(\alpha ^*_8=2\)
1
RCT
Bias
0.076
0.073
0.071
0.067
\(-\)0.082
\(-\)0.056
\(-\)0.084
\(-\)0.083
SD
0.170
0.176
0.180
0.189
0.189
0.180
0.183
0.177
SE
0.180
0.189
0.189
0.196
0.188
0.189
0.188
0.182
CP(95%)
0.934
0.948
0.950
0.946
0.924
0.962
0.928
0.940
RCT+RWD
Bias
0.000
0.015
0.008
0.045
0.001
0.003
0.003
0.006
SD
0.084
0.085
0.089
0.086
0.085
0.089
0.089
0.083
SE
0.079
0.082
0.082
0.086
0.085
0.082
0.082
0.082
CP(95%)
0.942
0.930
0.930
0.903
0.938
0.928
0.930
0.946
2
RCT
Bias
0.052
0.076
0.079
0.068
\(-\)0.068
\(-\)0.058
\(-\)0.059
\(-\)0.072
SD
0.201
0.220
0.217
0.224
0.218
0.216
0.229
0.221
SE
0.222
0.232
0.232
0.240
0.232
0.232
0.232
0.224
CP(95%)
0.948
0.952
0.954
0.958
0.947
0.958
0.948
0.947
RCT+RWD
Bias
0.000
0.011
0.011
0.042
0.006
0.006
0.008
0.015
SD
0.093
0.097
0.101
0.103
0.098
0.099
0.104
0.096
SE
0.096
0.099
0.097
0.107
0.107
0.100
0.099
0.099
CP(95%)
0.950
0.948
0.937
0.906
0.941
0.941
0.924
0.933
3
RCT
Bias
0.064
0.092
0.069
0.085
\(-\)0.072
\(-\)0.070
\(-\)0.084
\(-\)0.077
SD
0.166
0.188
0.181
0.189
0.187
0.184
0.178
0.177
SE
0.180
0.188
0.187
0.195
0.188
0.189
0.187
0.182
CP(95%)
0.943
0.927
0.937
0.933
0.945
0.927
0.931
0.947
RCT+RWD
Bias
0.037
0.041
0.027
\(-\)0.014
0.005
0.007
0.004
0.008
SD
0.171
0.163
0.161
0.157
0.085
0.088
0.094
0.085
SE
0.177
0.198
0.170
0.167
0.090
0.092
0.091
0.088
CP(95%)
0.965
0.961
0.963
0.963
0.955
0.965
0.937
0.935
4
RCT
Bias
0.067
0.060
0.070
0.069
\(-\)0.066
\(-\)0.050
\(-\)0.077
\(-\)0.080
SD
0.197
0.231
0.202
0.226
0.212
0.208
0.228
0.224
SE
0.223
0.235
0.233
0.240
0.232
0.232
0.231
0.223
CP(95%)
0.967
0.947
0.971
0.967
0.965
0.963
0.941
0.940
RCT+RWD
Bias
0.035
0.032
0.030
\(-\)0.017
0.009
0.009
0.005
0.011
SD
0.182
0.209
0.204
0.199
0.104
0.109
0.108
0.103
SE
0.237
0.271
0.217
0.206
0.115
0.116
0.117
0.116
CP(95%)
0.974
0.976
0.943
0.958
0.960
0.943
0.949
0.958
Case 1-4 represent (nuc, 20%CR), (nuc, 40%CR), (uc, 20%CR), (uc, 40%CR), where nuc means there is no unmeasured confounding while uc means there is unmeasured confounding. The SEs are marked in bold to make it clear for readers to make comparison between RCT and RCT+RWD

5 Application

Lung cancer has become the primary cause of cancer-related deaths across the globe, with increasing incidence over the last two decades (Sung et al. 2021). Surgical resection, including lobectomy and sublobar resection, is commonly used for early-stage lung cancer. Lobectomy involves the complete removal of the lung lobe where the tumor is located, while sublobar resection only entails the removal of a smaller section of the complicated lobe. In 1995, Ginsberg and Rubinstein reported a randomized trial that compared lobectomy with sublobar resection in patients with clinical T1N0 non-small-cell lung cancer (NSCLC) (Ginsberg and Rubinstein 1995). They found that compared with lobectomy, sublobar resection does not confer improved perioperative morbidity, mortality, or late postoperative pulmonary function. These results made lobectomy the standard of surgical treatment for patients with clinical T1N0 NSCLC. Sublobar resection for early-stage lung cancer has only been assigned for patients with poor pulmonary reserve or other major comorbidities contraindicating lobectomy. Over the years, however, advances in imaging and staging methods have allowed the detection of smaller and earlier tumors, leading to a renewed interest in sublobar resection for patients with clinical stage IA NSCLC who might otherwise accept a lobectomy (Saji et al. 2022).
Fig. 1
The estimated covariate effects in HTE. Here \(hist\_ade\) indicates a presence of histologic type - adenocarcinoma, \(hist\_squ\) suggests a presence of the histologic type - squamous-cell carcinoma
Fig. 2
The estimated covariate effects in unmeasured confounding. Here \(hist\_ade\) indicates a presence of histologic type - adenocarcinoma, \(hist\_squ\) suggests a presence of the histologic type - squamous-cell carcinoma
C140503 is a multicenter, noninferiority, phase 3 trial where NSCLC patients with tumor size \(\le\)2 cm were randomly assigned to undergo sublobar resection or lobar resection after intraoperative confirmation of node-negative disease (Altorki et al. 2023). From June 2007 to March 2017, a total of 697 patients were assigned to undergo sublobar resection (340 patients) or lobar resection (357 patients). For disease-free survival, the right censoring rate is 59.7% in the group with sublobar resection and 60.5% in the group with lobar resection. It concluded that sublobar resection was non-inferior to lobar resection with respect to disease-free survival. In addition, a post hoc analysis of the heterogeneity of treatment effects for disease-free survival across patient subgroups based on the Cox proportional hazards model revealed that age and tumor size intended to post a negative effect and positive effect on lobar resection, respectively. NCDB is a clinical oncology database maintained by the American College of Surgeons, and it accounts for 72% of all newly diagnosed lung cancer cases in the United States. The NCDB analysis based on multivariate Cox proportional hazards model and propensity score-based methods reveals a significant advantage of lobectomy over limited resection, which contradicts the results of C140503. This contradictory result may be attributed to unobserved hidden confounders in the NCDB database. It has been well-documented that surgeons and patients tend to opt for limited resection over lobectomy when the patient’s health status is poor, functional respiratory service is low, and there is a high burden of comorbidities (Zhang et al. 2019; Lee and Altorki 2023). Unfortunately, these hidden confounders were not captured in the NCDB database, which could potentially result in biased estimates of treatment effects.
Though NCDB provides abundant samples, it fails to give a valid result of causal effect due to unmeasured confounding. We intend to apply the proposed method to integrate the NCDB data to C140503. It is interesting to see whether the efficiency of the HTE estimate can be improved. We randomly selected a cohort of 3000 patients with stage 1A NSCLC from the NCDB database, ensuring that their tumor size was \(\le\)2 cm and they met all the eligibility criteria for C140503. We consider the covariates that appear in both C140503 and NCDB, including race (white and other), sex (male and female), age, tumor size, histologic type (squamous-cell carcinoma, adenocarcinoma and other). The estimated covariate effects in HTE and unmeasured confounding are presented in Fig. 1 and 2 respectively. In Fig. 2, the result shows that the effect of unmeasured confounding exists, which is consistent with the previous findings. It also reveals that the hidden confounding is significantly related to the patient’s age under 90% confidence level. In Fig. 1, it can be observed that compared with the C140503-only method, the proposed integrative estimator yields shorter confidence intervals. In particular, the estimated effects of sex (\(\alpha _{sex}\)) and presence of histologic type adenocarcinoma (\(\alpha _{ade}\)) are shrunk to zero when integrating NCDB. Since the C140503-only estimates \(\widehat{\alpha }_{sex}^{rct}\), \(\widehat{\alpha }_{ade}^{rct}\) show that the upper tail of the 90% confidence interval of \(\widehat{\alpha }_{sex}^{rct}\) is closed to zero and \(\widehat{\alpha }_{ade}^{rct}\) is not sigificant, it is reasonable to see the shrinkage when synthesizing NCDB. These results indicate that integrating the NCDB data to C140503 does improve the efficiency, which convincingly demonstrates the practical effectiveness of the proposed method.

6 Conclusion

In this paper, we have developed an integrative method to give an improved estimate of HTE by synthesizing the evidence from RCTs and RWD, particularly in situations where the outcome of interest is subject to censoring and the number of covariates is diverging. It can be seen that the situations we consider are more complex and realistic, bringing more challenges. The proposed method can deal with cases where unmeasured confounding is present in RWD. It can identify whether the unmeasured confounding effect exists in a fully data-driven manner, contributing to more efficient estimates and a deeper understanding of the data generation mechanism. We have rigorously established the theoretical properties, showing that the proposed integrative method yields a more efficient HTE estimate, at least as good as those based on only the RCTs data. The proposed method is practically applicable. Based on the evidence from C140503, the randomized controlled trial, and NCDB database, the real-world data that might be subject to hidden confounding, we have applied the proposed method to improve the estimate of the HTE on survival for patients with clinical T1N0 NSCLC undergoing lobar resection. The results reported that integrating NCDB data into C140503 enhanced the HTE estimation, convincingly indicating the practicality of the proposed method.
In this project, we focus on developing data integration methods utilizing Stute weights, given their widespread use and suitability under the censoring assumptions. However, we acknowledge the potential benefits of exploring more general doubly robust weighting approaches. Future work could extend our methods to incorporate IPCW and doubly robust techniques, potentially building on the frameworks established by Lee et al. (2022) and Lee et al. (2024), initially designed for trial generalization. Moreover, in the context of integrating RWD into RCTs, there are still many problems to be solved. For example, it is common to see that RCTs and RWD have different covariates. Merely taking into account the shared covariates may incur other problems. For instance, some critical covariates to describe heterogeneity in treatment may be excluded. Thus, it is important to develop an integrative approach that can deal with non-uniform covariates in RCTs and RWD. Moreover, RCTs with time-varying treatments are common. Integrative analysis of the continuous-time structural failure time model (Yang et al. 2020a) combining the complementary features of RCT and RWD will be an important topic for future research.

Acknowledgements

We are grateful to the referees for their insightful comments and constructive suggestions, which have significantly enhanced the quality of the paper. This research is supported by National Institute on Aging (1R01AG066883), Directorate for Social, Behavioral and Economic Sciences (SES 2242776), and National Natural Science Foundation of China (No. 12271459).

Declarations

Conflict of interest

The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anhänge

Supplementary Information

Below is the link to the electronic supplementary material.
Literatur
Zurück zum Zitat Altorki N, Wang X, Kozono D, Watt C et al (2023) Lobar or sublobar resection for peripheral stage IA non-small-cell lung cancer. N Engl J Med 388:489–498CrossRef Altorki N, Wang X, Kozono D, Watt C et al (2023) Lobar or sublobar resection for peripheral stage IA non-small-cell lung cancer. N Engl J Med 388:489–498CrossRef
Zurück zum Zitat Angrist J, Imbens G, Rubin D (1996) Identification of causal effects using instrumental variables. J Am Stat Assoc 91:468–472 Angrist J, Imbens G, Rubin D (1996) Identification of causal effects using instrumental variables. J Am Stat Assoc 91:468–472
Zurück zum Zitat Cheng Y, Wu L, Yang S (2023) Enhancing treatment effect estimation: A model robust approach integrating randomized experiments and external controls using the double penalty integration estimator. 39th Conference on Uncertainty in Artificial Intelligence (UAI 2023), accepted Cheng Y, Wu L, Yang S (2023) Enhancing treatment effect estimation: A model robust approach integrating randomized experiments and external controls using the double penalty integration estimator. 39th Conference on Uncertainty in Artificial Intelligence (UAI 2023), accepted
Zurück zum Zitat Chernozhukov V, Chetverikov D, Demirer M, Duflo E et al (2018) Double/debiased machine learning for treatment and structural parameters. Econ J 21:1–68MathSciNet Chernozhukov V, Chetverikov D, Demirer M, Duflo E et al (2018) Double/debiased machine learning for treatment and structural parameters. Econ J 21:1–68MathSciNet
Zurück zum Zitat Collins F, Varmus H (2015) A new initiative on precision medicine. N Engl J Med 372:793–795CrossRef Collins F, Varmus H (2015) A new initiative on precision medicine. N Engl J Med 372:793–795CrossRef
Zurück zum Zitat Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall, New YorkCrossRef Efron B, Tibshirani R (1993) An introduction to the bootstrap. Chapman and Hall, New YorkCrossRef
Zurück zum Zitat Fan J, Peng H (2004) Nonconcave penalized likelihood with a diverging number of parameters. Ann Stat 32:928–961MathSciNetCrossRef Fan J, Peng H (2004) Nonconcave penalized likelihood with a diverging number of parameters. Ann Stat 32:928–961MathSciNetCrossRef
Zurück zum Zitat Ginsberg R, Rubinstein L (1995) Randomized trial of lobectomy versus limited resection for T1 N0 non-small cell lung cancer. Lung Cancer Study Group. J Am Stat Assoc 60:615–623 Ginsberg R, Rubinstein L (1995) Randomized trial of lobectomy versus limited resection for T1 N0 non-small cell lung cancer. Lung Cancer Study Group. J Am Stat Assoc 60:615–623
Zurück zum Zitat Guo W, Zhou X, Ma S (2021) Estimation of optimal individualized treatment rules Using a covariate-specific treatment effect curve with high-dimensional covariates. J Am Stat Assoc 116:309–321MathSciNetCrossRef Guo W, Zhou X, Ma S (2021) Estimation of optimal individualized treatment rules Using a covariate-specific treatment effect curve with high-dimensional covariates. J Am Stat Assoc 116:309–321MathSciNetCrossRef
Zurück zum Zitat Hamburg MA, Collins FS (2010) The path to personalized medicine. N Engl J Med 363:301–304CrossRef Hamburg MA, Collins FS (2010) The path to personalized medicine. N Engl J Med 363:301–304CrossRef
Zurück zum Zitat He Q, Zhang HH, Avery CL, Lin D (2016) Sparse meta-analysis with high-dimensional data. Biostatistics 17:205–220MathSciNetCrossRef He Q, Zhang HH, Avery CL, Lin D (2016) Sparse meta-analysis with high-dimensional data. Biostatistics 17:205–220MathSciNetCrossRef
Zurück zum Zitat Henderson N, Louis T, Rosner G, Varadhan R (2020) Individualized treatment effects with censored data via fully nonparametric Bayesian accelerated failure time models. Biostatistics 21:50–68MathSciNetCrossRef Henderson N, Louis T, Rosner G, Varadhan R (2020) Individualized treatment effects with censored data via fully nonparametric Bayesian accelerated failure time models. Biostatistics 21:50–68MathSciNetCrossRef
Zurück zum Zitat Hu L, Ji J, Li F (2021) Estimating heterogeneous survival treatment effect in observational data using machine learning. Stat Med 40:4691–4713MathSciNetCrossRef Hu L, Ji J, Li F (2021) Estimating heterogeneous survival treatment effect in observational data using machine learning. Stat Med 40:4691–4713MathSciNetCrossRef
Zurück zum Zitat Huang J, Ma S, Xie H (2006) Regularized estimation in the accelerated failure time model with high-dimensional covariates. Biometrics 62:813–820MathSciNetCrossRef Huang J, Ma S, Xie H (2006) Regularized estimation in the accelerated failure time model with high-dimensional covariates. Biometrics 62:813–820MathSciNetCrossRef
Zurück zum Zitat Kallus N, Puli A, Shalit U (2018) Removing hidden confounding by experimental grounding. Adv Neural Inf Process Syst 31:10911–10920 Kallus N, Puli A, Shalit U (2018) Removing hidden confounding by experimental grounding. Adv Neural Inf Process Syst 31:10911–10920
Zurück zum Zitat Kuroki M, Pearl J (2014) Measurement bias and effect restoration in causal inference. Biometrika 101:423–437MathSciNetCrossRef Kuroki M, Pearl J (2014) Measurement bias and effect restoration in causal inference. Biometrika 101:423–437MathSciNetCrossRef
Zurück zum Zitat Kunzel S, Sekhon J, Bickel P et al (2019) Metalearners for estimating heterogeneous treatment effects using machine learning. Proc Natl Acad Sci USA 116:4156–4165CrossRef Kunzel S, Sekhon J, Bickel P et al (2019) Metalearners for estimating heterogeneous treatment effects using machine learning. Proc Natl Acad Sci USA 116:4156–4165CrossRef
Zurück zum Zitat Lee B, Altorki N (2023) Sub-lobar resection: the new standard of care for early-stage lung cancer. Cancers 15:2914CrossRef Lee B, Altorki N (2023) Sub-lobar resection: the new standard of care for early-stage lung cancer. Cancers 15:2914CrossRef
Zurück zum Zitat Lee D, Yang S, Wang X (2022) Doubly robust estimators for generalizing treatment effects on survival outcomes from randomized controlled trials to a target population. J Causal Inference 10:415–440MathSciNetCrossRef Lee D, Yang S, Wang X (2022) Doubly robust estimators for generalizing treatment effects on survival outcomes from randomized controlled trials to a target population. J Causal Inference 10:415–440MathSciNetCrossRef
Zurück zum Zitat Ma Y, Zhou X (2018) Treatment selection in a randomized clinical trial via covariate-specific treatment effect curves. Stat Methods Med Res 26:124–141MathSciNetCrossRef Ma Y, Zhou X (2018) Treatment selection in a randomized clinical trial via covariate-specific treatment effect curves. Stat Methods Med Res 26:124–141MathSciNetCrossRef
Zurück zum Zitat Neyman J (1959) Optimal asymptotic tests of composite statistical hypotheses. Probab Stat 96:213–234MathSciNet Neyman J (1959) Optimal asymptotic tests of composite statistical hypotheses. Probab Stat 96:213–234MathSciNet
Zurück zum Zitat Prentice RL, Chlebowski RT, Stefanick ML et al (2008) Estrogen plus progestin therapy and breast cancer in recently postmenopausal women. Am J Eepdemiol 167:1207–1216CrossRef Prentice RL, Chlebowski RT, Stefanick ML et al (2008) Estrogen plus progestin therapy and breast cancer in recently postmenopausal women. Am J Eepdemiol 167:1207–1216CrossRef
Zurück zum Zitat Powers S, Qian J, Jung K et al (2017) Some methods for heterogeneous treatment effect estimation in high dimensions. Stat Med 37:1767–1787MathSciNetCrossRef Powers S, Qian J, Jung K et al (2017) Some methods for heterogeneous treatment effect estimation in high dimensions. Stat Med 37:1767–1787MathSciNetCrossRef
Zurück zum Zitat Robins JM, Rotnitzky A, Scharfstein DO (1999) Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models, Statistical Models in Epidemiology, the Environment, and Clinical Trials. Springer, New York, pp 1–94 Robins JM, Rotnitzky A, Scharfstein DO (1999) Sensitivity analysis for selection bias and unmeasured confounding in missing data and causal inference models, Statistical Models in Epidemiology, the Environment, and Clinical Trials. Springer, New York, pp 1–94
Zurück zum Zitat Saji H, Okada M, Tsuboi M, Nakajima R et al (2022) Segmentectomy versus lobectomy in small-sized peripheral non-small-cell lung cancer (JCOG0802/WJOG4607L): a multicentre, open-label, phase 3, randomised, controlled, non-inferiority trial. Lancet 399:1607–1617CrossRef Saji H, Okada M, Tsuboi M, Nakajima R et al (2022) Segmentectomy versus lobectomy in small-sized peripheral non-small-cell lung cancer (JCOG0802/WJOG4607L): a multicentre, open-label, phase 3, randomised, controlled, non-inferiority trial. Lancet 399:1607–1617CrossRef
Zurück zum Zitat Shalit U, Johansson F, Sontag D (2017) Estimating individual treatment effect: generalization bounds and algorithms. Proc Int Conf Mach Learn 70:3076–3085 Shalit U, Johansson F, Sontag D (2017) Estimating individual treatment effect: generalization bounds and algorithms. Proc Int Conf Mach Learn 70:3076–3085
Zurück zum Zitat Simoneau G, Moodie E, Nijjar J, Platt R (2020) Estimating optimal dynamic treatment regimes with survival outcomes. J Am Stat Assoc 115(531):1531–1539MathSciNetCrossRef Simoneau G, Moodie E, Nijjar J, Platt R (2020) Estimating optimal dynamic treatment regimes with survival outcomes. J Am Stat Assoc 115(531):1531–1539MathSciNetCrossRef
Zurück zum Zitat Stute W (1993) Consistent estimation under random censorship when covariates are present. J Multivar Anal 45:89–103CrossRef Stute W (1993) Consistent estimation under random censorship when covariates are present. J Multivar Anal 45:89–103CrossRef
Zurück zum Zitat Stute W (1996) Distributional convergence under random censorship when covariables are present. Scand J Stat 23:461–471MathSciNet Stute W (1996) Distributional convergence under random censorship when covariables are present. Scand J Stat 23:461–471MathSciNet
Zurück zum Zitat Sung H, Ferlay J, Siegel R et al (2021) Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA-A Cancer J Clin 71:209–249CrossRef Sung H, Ferlay J, Siegel R et al (2021) Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA-A Cancer J Clin 71:209–249CrossRef
Zurück zum Zitat Verde PE, Ohmann C (2015) Combining randomized and non-randomized evidence in clinical research: a review of methods and applications. Res Synth Methods 6:45–62CrossRef Verde PE, Ohmann C (2015) Combining randomized and non-randomized evidence in clinical research: a review of methods and applications. Res Synth Methods 6:45–62CrossRef
Zurück zum Zitat Wager S, Athey S (2018) Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc 113(523):1228–1242MathSciNetCrossRef Wager S, Athey S (2018) Estimation and inference of heterogeneous treatment effects using random forests. J Am Stat Assoc 113(523):1228–1242MathSciNetCrossRef
Zurück zum Zitat Wang L, Kim Y, Li R (2013) Calibrating nonconvex penalized regression in ultra-high dimension. Ann Stat 41(5):2505–2536MathSciNetCrossRef Wang L, Kim Y, Li R (2013) Calibrating nonconvex penalized regression in ultra-high dimension. Ann Stat 41(5):2505–2536MathSciNetCrossRef
Zurück zum Zitat Wendling T, Jung K, Callahan A et al (2017) Comparing methods for estimation of heterogeneous treatment effects using observational data from health care databases. Stat Med 37:3309–3324MathSciNetCrossRef Wendling T, Jung K, Callahan A et al (2017) Comparing methods for estimation of heterogeneous treatment effects using observational data from health care databases. Stat Med 37:3309–3324MathSciNetCrossRef
Zurück zum Zitat Wu L, Yang S (2022) Integrative R-learner of heterogeneous treatment effects combining experimental and observational studies. Proc Mach Learn Res 140:1-S5 Wu L, Yang S (2022) Integrative R-learner of heterogeneous treatment effects combining experimental and observational studies. Proc Mach Learn Res 140:1-S5
Zurück zum Zitat Yang S, Pieper K, Cools F (2020a) Semiparametric estimation of structural failure time models in continuous-time processes. Biometrika 107:123–136MathSciNet Yang S, Pieper K, Cools F (2020a) Semiparametric estimation of structural failure time models in continuous-time processes. Biometrika 107:123–136MathSciNet
Zurück zum Zitat Yang S, Zeng D, Wang X (2020b) Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding, arXiv:2007.12922 Yang S, Zeng D, Wang X (2020b) Improved inference for heterogeneous treatment effects using real-world data subject to hidden confounding, arXiv:​2007.​12922
Zurück zum Zitat Yang S, Zeng D, Wang X (2023) Elastic Integrative analysis of randomised trial and real-world data for treatment heterogeneity estimation. J Roy Stat Soc B 85:575–596MathSciNetCrossRef Yang S, Zeng D, Wang X (2023) Elastic Integrative analysis of randomised trial and real-world data for treatment heterogeneity estimation. J Roy Stat Soc B 85:575–596MathSciNetCrossRef
Zurück zum Zitat Zhang Z, Feng H, Zhao H, Hu J et al (2019) Sublobar resection is associated with better perioperative outcomes in elderly patients with clinical stage I non-small cell lung cancer: a multicenter retrospective cohort study. J Thorac Dis 11:1838–1848CrossRef Zhang Z, Feng H, Zhao H, Hu J et al (2019) Sublobar resection is associated with better perioperative outcomes in elderly patients with clinical stage I non-small cell lung cancer: a multicenter retrospective cohort study. J Thorac Dis 11:1838–1848CrossRef
Zurück zum Zitat Zhou N, Zhu J (2010) Group variable selection via a hierarchical lasso and its oracle property. Stat Inference 3:557–574MathSciNet Zhou N, Zhu J (2010) Group variable selection via a hierarchical lasso and its oracle property. Stat Inference 3:557–574MathSciNet
Zurück zum Zitat Zhou N, Zhu L (2021) On IPW-based estimation of conditional average treatment effects. J Stat Plan Inference 215:1–22MathSciNetCrossRef Zhou N, Zhu L (2021) On IPW-based estimation of conditional average treatment effects. J Stat Plan Inference 215:1–22MathSciNetCrossRef
Zurück zum Zitat Zhu J, Gallego B (2020) Targeted estimation of heterogeneous treatment effect in observational survival analysis. J Biomed Inf 107:961CrossRef Zhu J, Gallego B (2020) Targeted estimation of heterogeneous treatment effect in observational survival analysis. J Biomed Inf 107:961CrossRef
Metadaten
Titel
Integrative analysis of high-dimensional RCT and RWD subject to censoring and hidden confounding
verfasst von
Xin Ye
Shu Yang
Xiaofei Wang
Yanyan Liu
Publikationsdatum
29.04.2025
Verlag
Springer US
Erschienen in
Lifetime Data Analysis
Print ISSN: 1380-7870
Elektronische ISSN: 1572-9249
DOI
https://doi.org/10.1007/s10985-025-09654-1

Premium Partner