Skip to content
Publicly Available Published by De Gruyter January 28, 2015

A Semi-stationary Copula Model Approach for Bivariate Survival Data with Interval Sampling

  • Hong Zhu EMAIL logo and Mei-Cheng Wang

Abstract

In disease registries, bivariate survival data are typically collected under interval sampling. It refers to a situation when entry into a registry is at the time of the first failure event (i.e., HIV infection) within a calendar time window. For all the cases in the registry, time of the initiating event (i.e., birth) is retrospectively identified, and subsequently the second failure event (i.e., death) is observed during follow-up. In this paper we discuss how interval sampling introduces bias into the data. Given the sampling design that the first event occurs within a specific time interval, the first failure time is doubly truncated, and the second failure time is possibly informatively right censored. Consider semi-stationary condition that the disease progression is independent of when the initiating event occurs. Under this condition, this paper adopts copula models to assess association between the bivariate survival times with interval sampling. We first obtain bias-corrected estimators of marginal survival functions, and estimate association parameter of copula model by a two-stage procedure. In the second part of the work, covariates are incorporated into the survival distributions via the proportional hazards models. Inference of the association measure in copula model is established, where the association is allowed to depend on covariates. Asymptotic properties of proposed estimators are established, and finite sample performance is evaluated by simulation studies. The method is applied to a community-based AIDS study in Rakai to investigate dependence between age at infection and residual lifetime without and with adjustment for HIV subtype.

1 Introduction

In disease surveillance systems or registries, it is common to collect data with a certain failure event, such as diagnosis of disease, occurring within a calendar time interval and then to obtain additional information retrospectively and/or prospectively. Such type of sampling is referred to as the interval sampling [1] and we consider bivariate survival data with interval sampling in this paper. One example of such data is AIDS blood transfusion data collected by the Centers for Disease Control, which is from a registry database, a common source of medical data [2]. Individuals who were diagnosed with AIDS during the course of the registry, July 1st, 1982, to June 30th, 1989, were recruited into the database and followed to study the progression of disease. In this example, the time of the initiating event of the HIV infection was retrospectively identified, and bivariate survival times of interest are the lag time from the infection to the AIDS diagnosis and the survival time after AIDS [2]. Generally speaking, under the interval sampling scheme, subjects experiencing the first failure event within a calendar time interval are identified as cases and entered into a registry [1]. For all the cases, the time of the initiating event is retrospectively confirmed and the occurrence of the second failure event is subsequently observed during the follow-up. Therefore, there is clearly a sampling bias due to the selection process, and subjects with the first failure events occurring before or after the course of the registry are unobservable and unaccountable. Any estimation and inference procedure done without consideration of this fact could possibly yield biased results. In the literature, sampling bias issues in disease surveillance data related to AIDS have been extensively studied; see the article of Brookmeyer [3] for an overview of research papers in the field. Methods were developed to handle, for example, various types of truncation, censoring and length-biased sampling.

This paper is partly motivated by a Rakai HIV study in investigating the relationship between age at infection and survival time of treatment-naive HIV-infected individuals. This study in the rural Rakai district of southwestern Uganda conducted annual surveillance from November 1994 in an open cohort of individuals aged 15–49 years [4]. Interest is focused on a cohort of HIV seroconverters, who were initially HIV negative, then seroconverted between 1995 and 2003 and followed until they died or were censored by out-migration or the end of follow-up. With a wide range of age at infection, the risk of death may be positively or negatively associated with increasing age at infection. The scientific goal is to explore how the HIV progression differs by the age at infection and HIV subtype. Since antiretroviral treatment (ART) became available in 2004 in the Rakai Health Sciences Program, the follow-up time and survival analysis were truncated on Dec 31st, 2003, to assess the survival of ART-naive HIV-infected individuals [4]. In this study, the initiating, first and second failure events correspond, respectively, to the birth, incidence of HIV infection and death. Bivariate survival times refer to the age at HIV infection and residual lifetime. Figure 1 provides a graphical presentation to illustrate how the interval sampling design arises in the Rakai HIV study.

Figure 1 An explanatory plot of data for a cohort of the Rakai HIV patients infected between 1995 (study initiation year) and 2003 (truncation year), T is the birth time, Y is the time from birth to HIV infection and Z is the time from HIV infection to death
Figure 1

An explanatory plot of data for a cohort of the Rakai HIV patients infected between 1995 (study initiation year) and 2003 (truncation year), T is the birth time, Y is the time from birth to HIV infection and Z is the time from HIV infection to death

As shown in Figure 1, the sampling population consists of individuals who became HIV infected between 1995 and 2003. Under the interval sampling, the age at HIV infection was observed subject to double truncation, that is, left truncation on Jan 1st, 1995, and right truncation on Dec 31st, 2003. The residual lifetime was possibly dependently right censored. A previous study [4] suggested that survival time decreased significantly with older age at infection. Their conclusion was based on comparing Kaplan–Meier survival curves among different groups of age at infection by the log-rank test and the estimated hazard ratio of death associated with age at infection by the Cox proportional hazards model. However, this ignored the fact that data were collected with the interval sampling, which may introduce substantial bias in data analysis. Therefore, our first research problem is to correctly examine the association between age at infection and residual lifetime by removing the bias from the interval sampling. Second, the HIV subtype was found to have an impact on the progression of HIV infection [5]. Thus, we further study the effect of HIV subtype on the association between age at infection and residual lifetime and evaluate the association adjusting for the HIV subtype.

In statistical literature, bivariate and multivariate survival data have been extensively studied. Various statistical methods have been developed to nonparametrically analyze bivariate survival data with right censoring [6, 7, 8, 9]. When the association of bivariate survival times is of interest, the semiparametric copula model has been becoming an increasingly popular tool for modeling the dependence, and in particular, the copula-based survival model has been proposed [10] for bivariate data both subject to right censoring. Wang [11] for bivariate survival data under dependent censoring and Lakhal-Chaieb et al. [12] for bivariate serial gap times. It is noted that these papers consider bivariate survival data, where both failure times are subject to right censoring. They are different from the bivariate survival data with interval sampling of our interest, for which the first failure time is subject to double truncation and the second failure time is subject to right censoring. Nevertheless, the copula family includes many useful bivariate survival models and enjoys flexibility in modeling [10]. An appealing feature is that it allows separate modeling and estimation of margins and the dependency parameter. Estimation and inference could be carried out by a two-stage procedure. At the first stage, marginal survival functions of each failure time are consistently estimated. At the second stage, the association parameter is estimated by maximizing a pseudo-likelihood with marginal survival functions replaced by their consistent estimators. The ideas of two-stage estimation for the copula model have been used by Genest et al. [13] for complete data, Shih and Louis [10] for right-censored data and Wang and Ding [14] for bivariate current status data. The properties of the proposed association estimators depend on regularity conditions on the imposed copula model and the plugged-in estimators of marginal survival functions. In these works, the covariate effect is often modeled through marginal distributions, such as the marginal Cox regression model, assuming the association parameter in the copula model is constant.

In this paper, we consider the semi-stationary copula model for bivariate survival data with interval sampling, and the association parameter is estimated through a two-stage procedure based on a pseudo-conditional likelihood. One challenge is how to correctly estimate the marginal distributions using interval sampling data, which is crucial to ensure unbiased estimation of the association between bivariate survival times. Under the stationary condition, the model was studied in Zhu and Wang [1], where it showed interval sampling does not induce bias on each univariate failure time. This paper relaxes the stationarity assumption to the semi-stationarity, which is a much less restrictive condition, and investigates the data structure that the first failure time is doubly truncated and the second failure time is dependently right censored. Moreover, motivated by the Rakai HIV study where the HIV progression is likely to depend on the HIV subtype, it is of interest to quantify the covariate effect on the association. Therefore, we focus on two scenarios. First, in the absence of covariates, we propose bias-corrected estimators for marginal survival functions and estimate the association parameter of the copula model by a two-stage procedure. Then we model the marginal distributions in a more flexible manner by the Cox proportional hazards models with covariates incorporated. The association is modeled in a parametric way to include covariates and we study the covariate-adjusted association. A novelty of the second model lies in the explicit linkage of the covariate effect to the association measure. Our approaches are not restricted to a particular copula but include the Clayton, positive stable and Frank copula models. The rest of the paper is organized as follows. In Section 2, the interval sampling design is discussed with more details, and the copula model for bivariate survival data as well as the semi-stationary model assumption are introduced. In Section 3, marginal survival distribution for each failure time and the association parameter in the copula are studied under the semi-stationary condition, without consideration of covariates. In Section 4, we incorporate covariates into the copula model and evaluate the association. Asymptotic properties of the proposed estimators are established. Finite sample performances are examined by simulation studies in Section 5. In Section 6, for illustration, the proposed method is applied to the Rakai HIV study. Finally, concluding remarks and discussion are included in Section 7. Technical details and proofs are provided in supplementary materials.

2 The semi-stationary Copula model

In this section, we describe the data structure for bivariate survival data with interval sampling and some fundamental concepts of the copula model, together with the semi-stationary model assumption. Statistical methods and inference are developed for a target population of individuals experiencing the first failure event of interest. To begin, we define random variables for the target population. Let T denote the calendar time of the initiating event, Y denote the time from the initiating event to the first failure event and Z denote the time from the first event to the second event. The failure times Y and Z are possibly correlated and their dependent relationship is of primary interest. Under the interval sampling, the sampling population is made up of subjects whose first failure events occur within a calendar time interval [0,t0], described by the constraint 0Y+Tt0. Therefore, bivariate failure times are observed subject to sampling bias. Specifically, Y is doubly truncated. Denote the double truncation rate by μ=1pr(TYt0T), and with this probability a person who experienced the first failure event will not be identified. Let C denote the censoring time measured from the first event, because censoring for the second event can only arise following the occurrence of the first event. Correspondingly, the observation of Z is censored by C.

Assume that the initiating event T occurs over the calendar time with a rate function λ(t) for tt0. Let fY,Z(y,z) denote the population joint density function of (Y,Z), and FY() and FZ() denote the population marginal cumulative distribution functions of Y and Z, respectively. We set y=inf{y:FY(y)>0}, y+=sup{y:FY(y)<1}, z=inf{z:FZ(z)>0}, z+=sup{z:FZ(z)<1}, t=inf{t:λ(t)>0}, and assume that failure time Y has finite support with y+< to reduce mathematical complexity in the discussion. The population density function of T, g(t), could be defined as a normalized rate function in the interval [y+,t0y] as, g(t)=λ(t)I(y+tt0y)/y+t0yλ(u)du, and its population cumulative distribution function is denoted by G(t).

Suppose bivariate failure times (Y,Z) come from the Cα copula for some association parameter αR, where Cα is a distribution function on [0,1]2 with density cα, then the joint survival function and density function of (Y,Z) are given by

SY,Z(y,z)=C{SY(y),SZ(z),α},y,z0
fY,Z(y,z)=c{SY(y),SZ(z),α}fY(y)fZ(z),y,z0

where SY(y), SZ(z), fY(y) and fZ(z) are the population marginal survival functions and densities of Y and Z, respectively. The association parameter α is closely related to Kendall’s τ, the rank correlation coefficient, expressed as

τ=40101C(u,v,α)dC(u,v,α)1

Assume that (T1,Y1,Z1),,(Tn,Yn,Zn) are independent and identically distributed (i.i.d.). We then introduce the following model assumption to facilitate the development of the proposed work.

Assume that bivariate failure times (Y,Z) are independent of the time of the initiating event T.

The model is considered to be semi-stationary if (S) is satisfied. It is noticed that the validity of the semi-stationarity assumption (S) may be questionable in the presence of systematic shifts in factors related to disease progression, such as availability of new treatments in the context of HIV infection. Nevertheless, the methods in this paper are proposed under the semi-stationary assumption for analyzing bivariate survival data with interval sampling. Actually, this assumption may be satisfied in some biomedical or epidemiological studies where no new treatment or diagnostic tool would have a significant impact on disease incidence or survival during a specific study period. In addition to the semi-stationarity, if the initiating events occur at a constant rate over the calendar time so that T follows a uniform distribution, the model is considered to be stationary as discussed in Zhu and Wang [1]. The stationary model forms a special case in which the bivariate survival data with interval sampling are unbiased. To be specific, under the stationarity, double truncation from the interval sampling does not result in bias on the first failure time Y, and the second failure time Z is independently right censored. In this paper, we focus on the model under the semi-stationary condition, which is less restrictive than the stationary condition, to address the issue of interval sampling bias.

Assume that (S) holds, the joint density of observed (t,y) can be written as

(1)pT,Y(t,y)=g(t)fY(y)I(ytt0y)pr(TYt0T)=g(t)I(ytt0y)G(t0y)G(y)×{G(t0y)G(y)}fY(y)0{G(t0u)G(u)}fY(u)du=pT|Y(t|y)pY(y),

where pT,Y(t,y) is the sampling joint density of (T,Y), pT|Y(t|y) refers to the first bracket of the above formula and pY(y) refers to the second bracket and is the marginal density of Y.

3 The copula model without covariates

Under the semi-stationary condition when (S) is satisfied, we fit a copula model for bivariate survival data with interval sampling, without considering covariates in this section. The data structure is studied and bias-corrected consistent estimators for marginal survival functions are proposed. In some situations, there is sufficient information on the distribution of the initiating event time T to determine a well-fitted parametric form. In such cases, it is desirable to make use of this information and incorporate it into the analysis. Therefore, we assume a parametric density function g(t;θ) to model T with corresponding distribution function G(t;θ), where θΘ and Θ is an open set in Rk. Taking the Rakai HIV study data, for example, g describes the birth trend for HIV seroconverters. In the following, under the semi-stationary condition that T is independent of (Y,Z), a conditional likelihood estimator of θ is obtained, and the inverse probability weighting method is employed to derive consistent estimators of marginal survival functions, SY(y) and SZ(z), at the first stage. At the second stage, the association parameter α in the copula model is estimated based on a pseudo-conditional likelihood. The approach is general and can be applied to different classes of the copula model.

Under Assumption (S), using eq. (1), the conditional likelihood function of observed {t} given observed {y} can be derived as

Lc(θ)=i=1nPT|Y(ti|yi,θ)=i=1ng(ti;θ)G(t0yi;θ)G(yi;θ)

in which the distribution of Y becomes a nuisance parameter and is eliminated by the conditioning procedure. The conditional maximum likelihood estimator θˆ is obtained by maximizing Lc(θ). The large sample properties of θˆ can be obtained using techniques for M-estimators. Under regularity conditions, as n, θˆ is consistent and n1/2(θˆθ) converges weakly to a mean zero multivariate normal distribution with variance–covariance matrix Ic1, where

Ic=EθlogpT|Y(Ti|Yi)θlogpT|Y(Ti|Yi)T

is the Fisher information matrix of Lc(θ).

We then explore the probability structure of the bivariate data to estimate marginal survival functions SY(y) and SZ(z). Due to the interval sampling, the sampling distributions of Y and Z are, in general, different from their population distributions. For the first failure time Y, as shown in formula (1), the sampling density pY(y) is proportional to its population density fY(y) as

pY(y)=w(y,θ)fY(y)0w(u,θ)fY(u)du

where w(y,θ)=G(t0y,θ)G(y,θ) is called the selection bias function and it represents the probability that a subject in the target population will be observed during the calendar time interval. The correction for the bias from interval sampling will make use of this selection bias function. It is clear that weighting each observation of Y by a weight that is inversely proportional to the selection bias function at the value of that observation will adjust for the sampling bias. Thus a consistent estimator of SY(y) can be derived as

SˆY(y,θˆ)=i=1n{G(t0yi;θˆ)G(yi;θˆ)}1I(yi>y)i=1n{G(t0yi;θˆ)G(yi;θˆ)}1

where yi is the observed first failure time and θˆ is the conditional likelihood estimator from Lc(θ). The weighted empirical survival function SˆY(y,θˆ) can be proved to be a semiparametric maximum likelihood estimator of SY(y).

For the second failure time Z, we discuss different situations of censoring on Z according to the real data application. In the Rakai HIV study, following the previous work [4], we consider that censoring occurs either due to out-migration or the end of the study. It is noted that there may be dependent censoring caused by informative dropout in some HIV studies. Developing a method to handle this dependent censoring is an interesting problem and will be explored in future work. Let C1 be the censoring time caused by out-migration and measured from the first event, where C1= if there is no out-migration. As for a prospective follow-up study, it is reasonable to assume the censoring time C1 by out-migration is independent of the residual lifetime Z. The independent censoring of C1 on Z may be satisfied by the study design and model assumption. Then, we only need to adjust for the interval sampling bias on Z. Since Y and Z are observed concurrently, the same selection bias function w(y,θ)=G(t0y,θ)G(y,θ) could be used to adjust for the sampling bias of observing Z induced by Y, and a weighted Kaplan–Meier estimator of SZ(z) is developed by the inverse probability weighting method. Let {(y,x)} where x=min(z,c1) denote the observed bivariate data, and the weighted Kaplan–Meier estimator of SZ(z) is given as

SˆZ(z,θˆ)=x(j)<z1idj2w(yi,θˆ)1idj2w(yi,θˆ)1=x(j)<z1idj2{G(t0yi,θˆ)G(yi,θˆ)}1iRj2{G(t0yi,θˆ)G(yi,θˆ)}1,

where θˆ is the conditional maximum likelihood estimator, dj2={i:xi=x(j)} and Rj2={i:xix(j)} are the failure event set and risk set at x(j), respectively, and {x(1),,x(k)} are distinct ordered uncensored second failure times with their counterparts at the first failure time as {y(1),,y(k)}. The inverse probability weighting method has been widely used in literatures particularly to reduce selection bias in observational studies. In our model setting, the selection bias function is constructed according to the biased distribution of Y under the interval sampling, thus the inverse of this function plays the role of correcting for the induced sampling bias of observing the second failure time Z. The weighted Kaplan–Meier estimator SˆZ(z,θˆ) is a semiparametric consistent estimator of SZ(z). The weak convergence of n1/2{SˆZ()SZ()} can be constructed following the lines in Aalen [15].

Next, let C2 denote the censoring time measured from the first event and caused by the end of the follow-up at time t0, that is, T+Y+C2=t0. Let M=T+Y denote the calendar time of the first event, then the censoring time of observing Z is t0M. First of all, we consider the case without selection bias from the interval sampling or other random censoring. When ZM, a standard estimate of SZ(z) is the Kaplan–Meier estimate. However, when Z and M are dependent, this estimator is inconsistent. One exception shown by Zhu and Wang [1] is that under the stationary condition when T follows a uniform distribution, M occurs at a constant rate and is independent of Z; therefore, SZ(z) can still be estimated by the Kaplan–Meier estimate. For correlated (Z,M), we can intentionally treat M as a covariate and fit a working Cox proportional hazards model of Z given M as

hZ(z|m)=h0Z(z)exp(μm)

where h0Z(z) is an unspecified baseline hazard function and μ is the regression coefficient. In the Rakai HIV study, M is the calendar time of HIV infection and Z is residual lifetime. The “exact method” for estimating the survival function of Z is to estimate the conditional survival distribution of Z given M using the working Cox model, then SZ(z) can be simply estimated by the empirical average of the estimated conditional survival distribution of Z given M over all the observed covariate M [16]. Secondly, similar to the previous discussion, we observe {(x,δ)}, where x=min(z,c2)=min(z,t0ty) and δ=I(zt0ty) with the selection probability w(y,θ) due to the interval sampling. Therefore, an inverse probability weighting method is applied for the Cox model estimation to adjust for the interval sampling bias. Define a weighted partial score function as

Uj2(μ)=mjiRj2w(yi,θˆ)1exp(μmi)miiRj2w(yi,θˆ)1exp(μmi)

Then a weighted estimating equation for μ is U2(μ)=jw(yj,θˆ)1Uj2(μ)=0 with the solution denoted by μˆ, and let Sˆ0Z(z) denote the corresponding Breslow estimator for the baseline survival function of Z. The conditional survival function of Z given M is consistently estimated by SˆZ(z|m)=Sˆ0Z(z)exp(μˆm). Accordingly, a consistent estimator of SZ(z) is obtained as SˆZ(z,θˆ)=i=1nn1Sˆ0Z(z)exp(μˆmi). The asymptotic properties of SˆZ(z,θˆ) depend on the asymptotic properties of μˆ and Sˆ0Z(z), for which a detailed discussion is provided in Section 4. Following our notations, the common censoring time C can be expressed as C=min(C1,C2)=min(C1,t0M), and the method based on the working Cox model would provide an appropriate estimator for SZ(z) in general.

We then present the estimation procedure for the association parameter α in the copula model based on a pseudo-conditional likelihood. Bivariate survival distribution of (Y,Z) is modeled by the copula as SY,Z(y,z)=C{SY(y),SZ(z),α}. First, we consider the situation when θ is known, which means the exact parametric distribution of T is available. The marginal survival functions are estimated by SˆY(y,θ) and SˆY(z,θ), respectively. For each subject i (i=1,,n), data {yi,xi,δi,ti} are observed where xi=min(zi,ci) and δi=I(zici). The joint density function of (Y,X,δ,T) can be expressed as a product of the conditional density function of (Y,X,δ|T) and the marginal density function of T. The corresponding conditional likelihood function, Lc(α), could be derived as

Lc(α)=ifY,Z(yi,xi)δiSY,Z(yi,xi)yi1δiSY(t0ti)SY(ti)ifY,Z(yi,xi)δiSY,Z(yi,xi)yi1δi

An interesting feature in the decomposition of likelihood is that the marginal likelihood function does not involve the parameter of interest α. Therefore, it is appropriate to estimate and make inference on α solely based on the conditional likelihood function Lc(α). Denote {SY(yi),SZ(xi)} by (ui,vi) and by the copula model of bivariate survival data,

Lc(α)i=1nl(ui,vi,α)=i=1nc(ui,vi,α)δiC(ui,vi,α)ui1δi

From the previous discussion, two margins SY(y) and SZ(z) could be consistently estimated by SˆY(y,θ) and SˆZ(z,θ), respectively. Therefore, a pseudo-conditional likelihood score equation is constructed by substituting SY(y) and SZ(z) with their consistent estimators and is given as

(2)Uα{S^Y(θ),S^Z(θ),α}(p)=(αi=1nδilog[c{S^Y(yi,θ),S^Z(xi,θ),α}]+(1δi)log[C{S^Y(yi,θ),S^Z(xi,θ),α}ui]=0)

The estimator of the association parameter α, αˆ(θ), is the solution to eq. (2). Note that T follows a uniform distribution under the stationary condition, where SˆY(y,θ) and SˆZ(z,θ) reduce to the empirical survival function and Kaplan–Meier estimator, respectively. As n, αˆ(θ) is a consistent estimator of α0, and n1/2{αˆ(θ)α0} converges weakly to normal distribution with mean zero and variance σ2.

Now we consider the general case when θ is unknown. It is natural to replace θ by the conditional maximum likelihood estimator θˆ and derive an estimator of α by solving the equation Uα{SˆY(θˆ),SˆZ(θˆ),α}(p)=0. Let αˆ(θˆ) denote its solution. The variability of αˆ(θˆ) can be decomposed into two terms as, αˆ(θˆ)α0={αˆ(θ)α0}+{αˆ(θˆ)αˆ(θ)}, where the variability in the first term is described by σ2, and that in the second term is generated by the use of θˆ to estimate θ. The corresponding distributions of the two terms can be proven to be asymptotically orthogonal to each other because θ in the second term is estimated by a conditional likelihood. The proposed estimator αˆ(θˆ) has the following asymptotic properties.

Theorem 1. Asn, αˆ(θˆ)is a consistent estimator ofα0, andn1/2{αˆ(θˆ)α0}converges weakly to normal with mean zero and varianceσ12=σ2+ρIC1ρT, where

ρ=EθUα{SY(y,θ),SZ(z,θ),α}i=1nVα{SY(y,θ),SZ(z,θ),α}Vα(u,v,α)=2logl(u,v,α)α2

and IC is the Fisher information matrix of Lc(θ).

The details of the proof are given in supplementary materials, where we also show σ12 can be consistently estimated by σˆ12=σˆ2+ρˆIˆC1ρˆT.

An estimator for bivariate survival function could be obtained by plugging in consistent estimators for unknown quantities in SY,Z(y,z)=C{SY(y),SZ(z),α}. To be specific, the margins SY(y) and SZ(z) are replaced by the proposed semiparametric consistent estimators, and α is replaced by the two-stage association estimator αˆ(θˆ). The asymptotic properties of SˆY,Z(y,z) are summarized in Theorem 2, with the proof provided in supplementary materials.

Theorem 2. Asn, SˆY,Z(y,z)is a consistent estimator ofSY,Z(y,z), and the processn1/2{SˆY,Z(y,z)SY,Z(y,z)}converges weakly to a bivariate zero-mean Gaussian process with covariance function[C{SY(y),SZ(z),α}/α]2σ12+Σ(y,z).

It is often interesting to report Kendall’s τ as a measure of association between bivariate survival data for convenient comparison across different copula models. With an estimator of the copula model association parameter α, a “plug-in” estimator of Kendall’s τ can be obtained based on τ=40101C(u,v,α)dC(u,v,α)1. For certain copula models, there exists a further simplified one-to-one relationship between τ and α. While the asymptotic variance of τ may be obtained by the Delta method, the computation is rather complex and it is more convenient to use the bootstrap approach to compute the variance estimate.

For the semi-stationary copula model, we rely on a parametric specification of G(t,θ) to take advantage of the available information about the distribution of T. It is expected to be more efficient than the model when the distribution of T is totally unknown and nonparametrically estimated. Of course, it is important to check the validity of the assumption H0:TG(t;θ). This can be done by plotting the nonparametric maximum likelihood estimate Gˆn(t) against G(t,θˆ). Since T is also doubly truncated subject to the constraint YTt0Y, estimating G is essentially a dual problem as estimating SY. Shen [17] provided an algorithm to jointly compute the nonparametric maximum likelihood estimators of both G and SY. In the data analysis, the plot is used as a graphical tool to examine the adequate fit of the parametric distribution of T.

4 The copula model with covariates

In this section, we propose to model the dependence of the copula association measure α on covariates by a transformation of γa, α=η(γa), where η() is a known function, a is a p×1 vector of continuous or discrete covariates and γ is the corresponding 1×p vectors of regression coefficients. For example, η() corresponds to an exponential transformation, exp(γa), for the Clayton copula. For the discrete covariate, we can either create a dummy variable for each level or assuming a linear trend across levels. Though the choice of η() depends on the form and specific constraints of the copula model, the inference procedure developed in this section is general for any type of the copula. The marginal distributions are modeled by proportional hazards models incorporating covariates in combination with the copula, under the semi-stationary condition. To be specific, bivariate survival distribution of (Y,Z) is modeled by the copula as SY,Z(y,z)=C{SY(y),SZ(z),α}, and marginally, we assume the failure times Y and Z satisfy the Cox proportional hazards models

hY(y|a)=h0Y(y)exp(βYa)hz(z|a,m)=h0Z(Z)exp(ϕa+μm),

where βY and ϕ are the corresponding 1×p vectors of regression coefficients, μ is the regression coefficient for the calendar time of the first event M, h0Y(y) and h0Z(z) are unspecified baseline hazard functions for Y and Z, respectively. Denote βZ=(ϕ,μ) and b=(a,M). Note that the Cox model of Z is specified given both a and M to handle dependent censoring on Z. The marginal survival functions SY(y) and SZ(z) will be estimated from these two proportional hazards models.

Similar to the discussion in Section 3 where there is no covariate, bivariate survival data (Y,Z) are collected with a selection probability equals w(y,θ)=G(t0y,θ)G(y,θ) due to the interval sampling, so we adjust for the sampling bias by the inverse probability weighting method. First consider θ is known. For the first failure time Y, define Rj1={i:yiyj} as the risk set at yj. Define a weighted partial score function as

(3)Uj1(θ,βY)=ajiRj1w(yi,θ)1exp(βYai)aiiRj1w(yi,θ)1exp(βYbi)

Let U1(θ,βY)=jw(yj,θ)1Uj1(θ,βY), and an estimating equation for βY is obtained as U1(θ,βY)=0, with its solution denoted by βˆY(θ). The inverse of the selection probability w(y,θ) is used to remove the sampling bias, since the probability of observing y is proportional to w(y,θ). For the second failure time Z, we observe {(x,δ)} where x=min(z,c) and δ=I(zc). A weighted estimating equation for βZ, U2(θ,βZ), can be constructed, similar to the one for the working Cox model in Section 3. The solution is denoted by βˆZ(θ). Now suppose θ is unknown. We replace θ by the conditional maximum likelihood estimator θˆ and derive estimators of βY and βZ by solving the corresponding equations U1(θˆ,βY)=0 and U2(θˆ,βZ)=0, respectively. The solutions are denoted by βˆY(θˆ) and βˆZ(θˆ), which enjoy nice asymptotic properties. Then from the proportional hazards models, marginal survival functions SY(y) and SZ(z) can be consistently estimated by SˆY(y,θˆ|a)=Sˆ0Y(y)exp{βˆY(θˆ)a} and SˆZ(z,θˆ|b)=Sˆ0Z(z)exp{βˆZ(θˆ)b}, where Sˆ0Y(y) and Sˆ0Z(z) are the Breslow estimators for baseline survival functions. The details of estimation procedures for βˆZ(θ), βˆY(θˆ), βˆZ(θˆ), SˆY and SˆZ, the asymptotic properties and proof are provided in supplementary materials.

To estimate γ, using steps similar to those in Section 3, a pseudo-conditional likelihood score equation is constructed by replacing SY(y) and SZ(z) with SˆY(y,θˆ|a) and SˆZ(z,θˆ|a,m),

Uγ{S^Y(θ^),S^Z(θ^),γ}(p)=γ(i=1nδilog[c{S^Y(yi,θ^|ai),S^Z(xi,θ^|ai,mi),η(γai)}]+(1δi)log[C{S^Y(yi,θ^|ai),S^Z(xi,θ^|ai,mi),η(γai)}ui])=0

Denote its solution by γˆ(θˆ), for which the asymptotic properties can be developed using the approach similar to those steps in Section 3. Therefore, as n, γˆ(θˆ) is a consistent estimator of γ0, and n1/2{γˆ(θˆ)γ0} converges weakly to a mean zero multivariate normal distribution with variance-covariance matrix Σγ. Given that asymptotic variance–covariance matrix involves second-order derivatives, the estimation of Σγ is not straightforward; therefore, we estimate it by a bootstrap procedure. The variance for the association measure α can be obtained by the Delta method, or alternatively, we can estimate its variance by the sample variance over the bootstrap samples. The bivariate survival function can also be estimated based on the copula model by substituting SY(y),SZ(z), and γ with their corresponding estimators derived in this section.

5 Simulation studies

The first set of simulations is carried out to assess the performance of the proposed estimation and inference procedures for the copula model without covariates under a moderate sample size. Specifically, we examine finite sample properties of the proposed estimators for marginal survival functions, association parameter and joint survival function. A set of data {(t1,y1,z1),,(tn,yn,zn)} is generated as follows. Define T=3W+10, where Wexp(θ) with θ=1.0 and 4.0, corresponding to a decreasing density function similar to the setting in the real data application, and moderate-to-heavy censoring. Let bivariate failure times (Y,Z) be generated from three Archimedean copula models: the Clayton, positive stable and Frank copulas. We choose unit exponential margins and three different values for α in each of the copula models, in order to accommodate different levels of dependence of Y and Z. An observation (t,y,z) is included in the interval sampled dataset if and only if 0t+y10, and is censored if t+y+z10 or zc1, where c1 is generated randomly from Uniform(0,4). When θ=1.0, the truncation rate is about 25–30% and the censoring rate is about 30–35%. When θ=4.0, the truncation rate is about 55–60% and the censoring rate is about 45–50%. For each choice of parameters (θ,α), 1,000 simulated samples are generated with sample size n=400.

The proposed estimators for the marginal survival functions of Y and Z are evaluated in Figure 2, where bivariate failure times (Y,Z) are generated from a Clayton copula model with α=3. The plots are based on the means of 1,000 simulations, and demonstrate that the weighted empirical survival function outperforms the empirical one in estimating SY(y), and the weighted estimator from a working Cox model outperforms the Kaplan–Meier estimator in estimating SZ(z). The copula association parameter estimate is obtained as αˆ(θˆ) by solving eq. (2). Particularly, the bivariate survival function estimator SˆY,Z(y,z) is assessed at (y,z)=(0.22, 0.51), denoted by Sˆ1. We also estimate Kendall’s τ as another measure of association. Table 1 provides the simulation results about θˆ, αˆ(θˆ), Sˆ1 and τˆ for the copula model without covariates, including the empirical bias, average model-based standard error, empirical standard error and 95% coverage probability. The confidence interval is constructed using the estimated asymptotic variance, and the empirical estimate of the 95% coverage probability is obtained based on the confidence interval over 1,000 replications. It shows the proposed estimators θˆ, αˆ(θˆ), Sˆ1 and τˆ work well with fairly small biases. For the association estimator αˆ(θˆ), the average model-based standard error is very close to the empirical standard error, which implies satisfactory performance of the inferential result of it. The coverage probabilities are all quite close to 95%. Moreover, the estimated standard error of αˆ(θˆ) increases in general with stronger dependence of bivariate data (Y,Z) indicated by a larger absolute value of α. This phenomenon is not very surprising since greater variations are usually expected for larger values.

Figure 2 Simulation results of estimations of marginal survival functions of Y and Z: true survival functions (solid), the proposed weighted estimates (dash), and the estimates by conventional methods (dot). WEMP: weighted empirical survival function; EMP: empirical survival function; WCOX: weighted estimator by a working Cox model; KM: the Kaplan–Meier estimator.
Figure 2

Simulation results of estimations of marginal survival functions of Y and Z: true survival functions (solid), the proposed weighted estimates (dash), and the estimates by conventional methods (dot). WEMP: weighted empirical survival function; EMP: empirical survival function; WCOX: weighted estimator by a working Cox model; KM: the Kaplan–Meier estimator.

Table 1

Simulation summary statistics of (θˆ,αˆ,Sˆ1,τˆ) for the copula model without covariates

θb(θˆ)σˆ(θˆ)αb(αˆ)σˆe(αˆ)σˆ(αˆ)cp(αˆ)b(Sˆ1)σˆe(Sˆ1)cp(Sˆ1)b(τˆ)σˆe(τˆ)
Clayton copula
1.00.38.40.501.114.813.395.3−0.33.294.60.34.9
0.17.21.332.624.118.295.5−0.23.095.71.04.7
0.17.63.008.144.842.295.7−0.23.193.51.33.6
4.00.416.20.502.738.336.895.3−0.24.295.42.711.3
1.017.51.3313.465.964.295.9−0.24.395.65.010.8
1.716.43.0019.597.194.796.4−0.64.494.25.48.3
Positive stable copula
1.00.17.21.251.56.25.093.10.23.295.10.73.9
0.67.91.671.310.18.393.5−0.23.194.60.63.4
0.27.62.500.616.314.294.3−0.33.294.80.32.6
4.01.315.81.251.88.56.794.30.24.695.10.85.6
1.817.31.670.715.514.294.4−0.14.495.70.45.4
1.416.52.501.326.223.795.6−0.34.596.10.74.4
Frank copula
1.00.77.92.001.041.840.695.40.13.194.50.14.1
0.27.5−1.003.242.240.594.80.13.295.20.24.6
0.17.8−2.000.442.140.695.8−0.13.194.20.24.3
4.01.017.52.003.182.781.096.6−0.34.796.30.48.7
1.721.7−1.001.985.283.996.3−0.24.795.71.39.7
1.422.0−2.006.592.791.196.5−0.24.596.11.48.6

The second set of simulations is conducted to examine the finite sample performance of parameters in the copula model with covariates. The data-generating procedure generally follows that in the first set of simulations. The differences are, first of all, the association measure α of each copula is allowed to depend on a covariate A Bernoulli(1/3). For the Clayton and positive stable copulas, α is modeled on the log scale as log(α)=γ0+γ1a. For the Frank copula, α is modeled on the original scale as α=γ0+γ1a. We set slope γ1=1/8 and choose different values of intercept γ0 to achieve different levels of the association of Y and Z. Second, for marginal failure times Y and Z, we use the Weibull densities to construct the proportional hazards models

hY(y|a)=h0Y(y)exp(βYa)
hZ(z|a)=h0Z(z)exp(ϕa)

where h0Y(y)=2y, h0Z(y)=2z, and βY=ϕ=0.2. Finally, for the distribution of T, we set θ=2.0. An observation (a,t,y,z) is included in the interval sampled dataset if and only if 0t+y10, and is censored if t+y+z10 or zc1, where c1 is generated randomly from Uniform(0,4). The truncation rate is around 40% and the censoring rate is around 50%. For each setting, 1,000 simulated samples are generated with sample size n=400. We report the association estimate in terms of Kendall’s τ instead of the copula association parameter α for convenient comparison across different copula families, and study the bias and variance for τˆ in different scenarios. For each dataset, the value of τˆ is computed based on the estimated parameters. The variance estimation is obtained by the bootstrap approach, where 500 bootstrap samples are used. Table 2 summarizes the simulation results about θˆ,βˆY(θˆ),ϕˆ(θˆ), and τˆ for the copula model with covariates. Under all the simulation scenarios, βˆY(θˆ) and ϕˆ(θˆ) obtained from the weighted estimating equations perform well with small biases and variances. For the association estimate τˆ, the bias is generally small, the magnitudes of the empirical standard error and average bootstrap standard error are similar and the coverage probabilities are close to 95%. This demonstrates the proposed inference procedure for the copula model with covariates is reasonably good and suggests that the bootstrap variance estimator provides an appropriate measure of the variability of τˆ.

Table 2

Simulation summary statistics of (θˆ,βˆY,ϕˆ,τˆ) for the copula model with covariates, where θ=2.0, γ1=1/8, βY=ϕ=0.2

b(θˆ)σˆ(θˆ)bYσˆYbZσˆZAτb(τˆ)σˆe(τˆ)σˆb(τˆ)cp(τˆ)
Clayton copula
0.810.81.78.80.811.700.20−0.13.53.695.6
10.22−0.43.53.895.6
0.910.20.48.81.111.800.40−0.63.43.995.5
10.43−0.83.33.695.7
0.410.61.28.71.611.500.60−1.02.72.996.1
10.63−0.92.52.896.2
Positive stable copula
−0.410.90.68.80.411.500.200.43.94.193.6
10.300.13.74.093.5
0.611.10.18.80.111.700.400.53.63.794.3
10.470.23.33.294.5
−0.310.9−0.48.9−1.111.800.60−0.23.03.294.7
10.65−0.32.72.895.1
Frank copula
0.810.50.28.80.512.000.200.94.44.796.0
10.230.24.04.496.2
0.510.7−0.98.70.911.80−0.100.36.56.895.5
1−0.09−0.24.14.595.8
0.310.31.08.9−0.612.20−0.20−0.54.24.595.7
1−0.211.24.24.495.8

6 Application to the Rakai HIV study

6.1 Overall association

The HIV seroconversion data from the Rakai HIV study provide an example of bivariate survival data with interval sampling. In this study, 837 subjects were ascertained with a documented date of HIV seroconversion between 1995 and 2003, and followed until they died or by the end of 2003. Among them, 120 died and others were censored by out-migration or the end of the follow-up. The information on date of birth, date of death, sex, place of residence and HIV subtype is available. The bivariate survival times of interest are age at HIV infection and residual lifetime. Exclusion of subjects who were infected before 1995 or after 2003 results in selection bias of the interval sampling. For the purposes of illustration, we apply the proposed semi-stationary copula model methods to analyze the Rakai HIV seroconversion data, address statistical issues of the interval sampling and study the association between age at HIV infection and residual lifetime among HIV seroconverters. The data and analysis method allow one to model the HIV epidemic for treatment-naive individuals, which would help provide guidance on the initiation of ART.

In the analysis we assume the semi-stationary condition holds, that is, the progression of HIV is independent of the birth time of the study cohort. Denote the birth time of HIV seroconverters by T with distribution function G(t), age at infection by Y, residual lifetime after infection by Z and calendar time of HIV infection by M=T+Y. Given that the HIV seroconversion was identified between 1995 and 2003, and the cohort subjects’ age ranges from 15 to 50 years old, we let the support of G(t) start from –50, corresponding to the birth year from 1945, to avoid non-identifiability problem regarding the estimation of G(t). Recall that we assume the parametric distribution of T is known, and two polynomial functions are used to model the density of birth time T: a linear model g(t)=k+θ1t, and a quadratic model g(t)=k+θ1t+θ2t2, where k is a given positive-valued constant in both models and t50. The choice of the parametric form of the distribution of T is examined by comparing the parametric estimate of the distribution function G(t) with its nonparametric maximum likelihood estimate [17]. Figure 3(a) plots the parametric estimators, the nonparametric maximum likelihood estimator and the empirical estimator of the density function for T. It demonstrates substantial discrepancies between the empirical estimator and other ones, while a similar decreasing trend in the birth rate of HIV seroconverters is found by the two polynomial models and the nonparametric method. As discussed in Section 3, T is doubly truncated with the sampling constraint YTt0Y; therefore, the sampling density of T is generally biased and it is not appropriate to use the empirical method to estimate the birth rate. Further, the decreasing trend may partly reflect the change in the population under surveillance, such as trends in HIV incidence and prevalence. Actually, the HIV incidence in Rakai declined from approximately 2.0 per 100 person-years in 1995 to 1.3 per 100 person-years in 2003, and the HIV prevalence declined from approximately 18% in 1995 to 13% in 2003 [4]. Given the small difference between the two polynomial model fits as well as the closer pattern between the nonparametric and quadratic model estimates, we choose the quadratic model for birth density. The parameter estimates together with their estimated standard errors are θˆ1=−2.00×103(4.35×104) and θˆ2=8.30×105(2.17×106). In addition, the parametric assumption of the distribution of T, H0:TG(t,θ), is checked in Figure 3(b) by plotting the nonparametric maximum likelihood estimator Gˆn(t) against G(t,θˆ) and it suggests the assumption of quadratic birth density is considerably reasonable.

Figure 3 (a) Birth density plots: linear model estimate (dash), quadratic model estimate (dot), nonparametric estimate (dash-dot), and the biased empirical estimate (solid). (b) Scatter plot of Gˆn(t)$${\hat G_n}(t)$$ against G(t,θˆ)$$G(t,\hat \theta)$$. The dashed diagonal line y=x$$y = x$$ is shown as reference. Non-G, nonparametric maximum likelihood estimator Gˆn(t)$${\hat G_n}(t)$$; Para-G, parametric estimator G(t,θˆ)$$G(t,\hat \theta)$$
Figure 3

(a) Birth density plots: linear model estimate (dash), quadratic model estimate (dot), nonparametric estimate (dash-dot), and the biased empirical estimate (solid). (b) Scatter plot of Gˆn(t) against G(t,θˆ). The dashed diagonal line y=x is shown as reference. Non-G, nonparametric maximum likelihood estimator Gˆn(t); Para-G, parametric estimator G(t,θˆ)

Next, the marginal survival functions SY(y) and SZ(z) for age at HIV infection and residual lifetime are estimated by the weighted empirical survival function and weighted estimator from a working Cox model, adjusting for the selection bias from interval sampling. Figure 4(a) shows that the empirical method overestimates the marginal survival of age at infection comparing to the weighted one, and Figure 4(b) shows that the Kaplan–Meier estimator overestimates the marginal survival of residual lifetime comparing to the weighted one from the working Cox model.

Figure 4 (a) Estimated marginal survival functions of age at infection. EMP: empirical survival function; WEMP: weighted empirical survival function. (b) Estimated marginal survival functions of residual lifetime. KM: the Kaplan–Meier estimator; WCOX: weighted estimator by a working Cox model. (c) Estimated marginal survival functions of residual lifetime for different categories of age at infection. (d) Estimated conditional survival functions of residual lifetime given different categories of age at infection
Figure 4

(a) Estimated marginal survival functions of age at infection. EMP: empirical survival function; WEMP: weighted empirical survival function. (b) Estimated marginal survival functions of residual lifetime. KM: the Kaplan–Meier estimator; WCOX: weighted estimator by a working Cox model. (c) Estimated marginal survival functions of residual lifetime for different categories of age at infection. (d) Estimated conditional survival functions of residual lifetime given different categories of age at infection

Previous analysis [4] showed survival time decreased significantly with older age at infection based on a Cox proportional hazards model of residual lifetime conditional on age at infection. However, the appropriateness of the Cox model is under investigation since it did not take into account the fact that the data are collected under the interval sampling. As discussed, due to the interval sampling, age at infection is doubly truncated and residual lifetime is observed subject to dependent right censoring. Therefore, selection bias needs to be corrected in analyzing the data and studying the association between age at infection and residual lifetime. We consider the proposed copula model without covariates in Section 3, where the dependency structure is fitted by the Frank copula, and assess the overall association. To estimate the standard error of the association estimator, we adopt a nonparametric bootstrap method by sampling 837 subjects with replacement from the dataset. The resampling procedure is repeated 500 times. The confidence interval is constructed based on the asymptotic normality, where the standard error is computed using bootstrap resamples. The association parameter α is estimated as –0.19 with the 95% confidence interval (–1.21, 0.85). The corresponding Kendall’s τ is estimated as –0.02 with the confidence interval (–0.13, 0.09). Differing from the result of the previous study, our analysis suggests a non-significant negative association between age at infection and residual lifetime among HIV seroconverters after adjusting for the sampling bias. Further, the relationship between bivariate survival times is explored graphically in Figure 4(c) by plotting estimated marginal survival functions of residual lifetime for two categories of age at infection: <30 years and 30 years, as well as in Figure 4(d) by plotting estimated conditional survival functions of residual lifetime given these two categories. Figure 4(c) demonstrates a slightly negative association and a trend towards lower survival probability with older age at infection. However, Figure 4(d) shows that the estimated survival probability conditional on age at infection 30 years is comparable to that conditional on age at infection <30 years, which may explain the low degree and non-significance of negative association as the estimation of α shows.

6.2 Subtype-stratified association and subtype-adjusted association

Studies suggest that the progression of HIV infection is affected by the HIV subtype [5]. HIV subtypes differ in biological characteristics that may affect pathogenicity, such as viral fitness and plasma viral loads. These differences may theoretically influence virus infectivity and transmissibility. We investigate this issue by analyzing the Rakai HIV seroconversion data with information on HIV subtype. Among 837 HIV seroconverters, 413 individuals’ HIV subtypes could be identified because their blood serum samples had sufficient HIV RNA for reverse transcriptase polymerase chain reaction amplification. Subtypes were classified as A (15.4%), C (0.5%), D (58.3%), AD recombinants (20.2%) and multiple infections (5.6%). Earlier analysis of the Rakai data suggests that subtypes D, AD recombinants and multiple infections have similar disease progression rates and there is only one individual with subtype C infection in this data set, so for the analysis purposes we compare A subtype with combined non-A virus subtypes [4].

First of all, we stratify by the HIV subtype to assess the association between age at infection and residual lifetime. The estimated marginal survival functions of age at infection and residual lifetime by the HIV subtype are shown in Figure 5. It demonstrates that survival curves of age at infection for A subtype, non-A subtypes and unknown subtype are almost the same, but the survival probability of residual lifetime is substantially lower for non-A and unknown subtypes compared with A subtype. In fact, there are only 2 deaths among 64 subtypes of A infections, compared with 45 deaths among 349 non-A subtypes infections. It is consistent with the result from the previous study in Uganda [5] that A subtype has a slower disease progression rate, and it is thought to be less pathogenic than other subtypes. The association measure α in the Frank copula model is estimated as 2.96 with the 95% confidence interval (–0.56, 6.48) for A subtype, –0.38 with (–1.64, 0.93) for non-A subtypes and –0.36 with (–1.16, 0.48) for unknown subtype. The corresponding Kendall’s τ is estimated as 0.30 with the confidence interval (–0.06, 0.54) for A subtype, –0.04 with (–0.18, 0.10) for non-A subtypes and –0.04 with (–0.13, 0.05) for unknown subtype. Very interestingly, it shows a comparable negative association for non-A and unknown subtypes and conversely a positive association for A subtype, though the associations are not significant. The result suggests that the Rakai HIV epidemic probably has a predominance of non-A subtypes infection and subtype A appears to be a very different virus subtype in the HIV progression from other subtypes, which is consistent with the conclusions from other studies.

Figure 5 Estimated marginal survival functions of age at infection and residual lifetime by HIV subtype: curves for A subtype (solid), curves for non-A subtype (dash), and curves for unknown subtype (dot)
Figure 5

Estimated marginal survival functions of age at infection and residual lifetime by HIV subtype: curves for A subtype (solid), curves for non-A subtype (dash), and curves for unknown subtype (dot)

Next, we examine the relationship between age at infection and residual lifetime adjusting for the HIV subtype by the copula model with covariates in Section 4. The dependence structure in the Frank copula is allowed to depend on the HIV subtype through α=γ0+γ1a, where a=(a1,a2), a1 denotes the indicator for A subtype and a2 denotes the indicator for non-A subtypes. An individual’s subtype is unknown if a=(0,0). Marginally, age at infection and residual lifetime given the HIV subtype are modeled through the Cox proportional hazards models: hY(y|a)=h0Y(y)exp(βYa) and hZ(z|a,m)=h0Z(z)exp(ϕa+μm) where m is the calendar time of HIV infection. From weighted estimating equations, the proposed estimate vector for βY is (0.05, –0.07) with P-values >0.05 and that for ϕ is (–0.41, 0.29) with P-values <0.05. The values of estimates for βY and ϕ (the estimates of the covariate effects on each failure time) suggest similar survival patterns among HIV subtype groups as shown in Figure 5, which indicates no significant difference in age at infection by the HIV subtype, significant lower risk of death comparing A subtype to unknown subtype, and significant higher risk of death comparing non-A subtypes to unknown subtype. After covariate adjustment, the copula association parameter α is estimated as 4.31 with the 95% confidence interval (1.98, 6.64) for A subtype, –0.95 with (–2.30, 0.52) for non-A subtypes and –0.44 with (–1.76, 0.98) for unknown subtype. Correspondingly, Kendall’s τ is estimated as 0.41 with the confidence interval (0.21, 0.55) for A subtype, –0.10 with (–0.24, 0.06) for non-A subtypes and –0.05 with (–0.19, 0.11) for unknown subtype. The confidence interval is computed based on 500 realizations of the resampling method. For each subtype, the adjusted association estimate suggests the same direction of the association as its unadjusted counterpart but a larger magnitude. The patterns of the association are similar between non-A subtypes and unknown subtype, both of which are non-significantly negative. It shows A subtype presents a significant positive association, which again illustrates that A subtype is very different from other HIV subtypes in the progression of disease.

7 Concluding remarks

This paper considers statistical issues on bivariate survival data with interval sampling, which arises commonly in disease registries or surveillance systems where data are collected conditional on the first failure event occurring within a specific time interval. In this paper, the semi-stationary condition is assumed for statistical modeling and inference. It is important to indicate that the semi-stationary assumption could be violated when, for instance, improved diagnostic strategies over time lead to earlier detection, or an effective treatment becomes available and is given to the diseased individuals during the process of observation. The situation when (S) does not hold is beyond scope of this paper and will be explored in future research. Under the semi-stationary assumption, we investigated the association between bivariate survival data with interval sampling by the copula models without and with adjustment for covariates. Since the asymptotic variance for parameters in the copula model with covariates has a rather complicated form and possibly involves the censoring distribution, the bootstrap method is applied as a direct and robust way to compute standard errors.

The data structure considered in this paper assumes that the first failure time Y is observed exactly. However, as pointed out by one reviewer, there are situations in some HIV cohorts where Y is interval censored. The interval censoring problem of Y is worthy of further investigation and may be handled by extending the prior work on bivariate survival distribution estimation for interval-censored outcomes [18]. For convenience of discussion, the proportional hazards model is used to model the relationship between each failure time and covariates. In fact, any regression model for survival data, such as the semiparametric transformation model or accelerated failure time model, can be used. The proposed copula model framework is very general and can be modified to accommodate other regression model for the marginal distribution. Moreover, it is noted that the same covariates may not be relevant as causal factors to the two failure events of interest, as well as the association between them. Nevertheless, in development of the method, we consider a general case by including the same set of covariates in modeling marginal distributions and the association, which would allow one to test the significance and estimate the covariate effects. In real data applications, one can always start with a general model by including all factors of interest as covariates and only keep the significant ones in the final model.

In the simulations and data application, certain specific copula models are used to characterize the dependence structure of bivariate survival data given their popularity, modeling flexibility and computational convenience. The simulations show knowing the true copula, the estimation procedure provides good results, and we anticipate that the procedure will also perform well for other copulas. While in fact, any copula model could be considered and this raises closely related issues on how a wrong choice of the copula model would affect the estimated association measure and how to choose an appropriate copula model. Since different copula models may lead to different dependence properties of bivariate survival function, the problem of model selection of the copulas needs to be addressed in future work. Potentially, Goodness-of-fit procedures for the copulas could be developed for bivariate survival data with interval sampling. In absence of covariates, we suggest to compare the copula model fit with some nonparametric estimates of the association, such as cross-ratio function or Kendall’s τ. Therefore, the nonparametric association estimation for such data may shed some light on it but is still under development.

The research is motivated by and applied to the Rakai HIV seroconversion data to evaluate the association between age at HIV infection and residual lifetime among treatment-naive HIV seroconverters, and study how the association varies by the HIV subtype and changes after controlling for the HIV subtype. Another scientifically interesting problem for further research is to examine the ART effect on the HIV progression. In the Rakai Heath Science Program, the ART became available in 2004 and this time-dependent treatment variable would further complicate the analysis.

Acknowledgments

This work was supported in part by the Cancer Center Support Grant from the National Cancer Institute awarded to the Harold C. Simmons Cancer Center at the University of Texas Southwestern Medical Center. The authors thank the editor, the associate editor and two reviewers for their constructive comments that have greatly improved the initial version of this paper. We also thank the Rakai Health Sciences Program at Johns Hopkins Bloomberg School of Public Health for providing the data.

Appendix A: Asymptotic properties of α^(θ^) in the copula model without covariates

Proof of Theorem 1

Assume that the standard regularity conditions for the maximum likelihood estimator hold and the functions Wα{SY(y),SZ(z),α}, Vα{SY(y),SZ(z),α}, Vα,1{SY(y),SZ(z),α}, and Vα,2{SY(y),SZ(z),α} are continuous and bounded for (y,z)A=[y,y+]×[z,z+], where

Wα(u,v,α)=logl(u,v,α)α,Vα(u,v,α)=2logl(u,v,α)α2
Vα,1(u,v,α)=2logl(u,v,α)αu,Vα,2(u,v,α)=2logl(u,v,α)αv

In the following, we study the asymptotic properties of αˆ(θˆ).

If θ is known, as n, αˆ(θ)α0 converges to 0 in probability, and n1/2{αˆ(θ)α0} converges weakly to normal distribution with mean zero and variance σ2. The large sample properties of αˆ(θ) can be proved following the lines in Zhu and Wang [1], where marginal survival functions SY(y) and SZ(z) are replaced by the corresponding consistent estimators involving θ. Observe that

(4)n1/2{αˆ(θˆ)α0}=n1/2{αˆ(θ)α0}+n1/2{αˆ(θˆ)αˆ(θ)}

As previously discussed, if θ is known, the first term in eq. (4) converges weakly to a normal distribution with mean 0 and variance σ2. By the counting process methodology [19], it is asymptotically equivalent to a sum of n i.i.d. zero-mean random variables, expressed as

(5)n1/2{αˆ(θ)α0}=n1/2i=1nϕi(α,θ)+op(1)

where ϕi(α,θ)=Uα{SY(y,θ),SZ(z,θ),α}[i=1nVα{SY(y,θ),SZ(z,θ),α}1 and for each θ, E{ϕi(α,θ)}=0.

To develop the asymptotic results of the second term in eq. (4), the additional variation created by estimating θ by θˆ, the conditional maximum likelihood estimator based on Lc(θ), needs to be handled. Let ρ=E{ϕi(α,θ)/θ}, under appropriate regularity conditions, the second term can be approximated by a sum of n i.i.d. zero-mean random variables, expressed as

(6)n1/2{αˆ(θˆ)αˆ(θ)}=n1/2ρIC1i=1nθlogpT|Y(Ti|Yi)+op(1)

which converges weakly to a normal distribution with mean 0 and variance ρIC1ρT. Thus, αˆ(θˆ)αˆ(θ) converges to 0 in probability. Therefore, αˆ(θˆ)α0={αˆ(θ)α0}+{αˆ(θˆ)αˆ(θ)} converges to 0 in probability. This completes the proof of consistency of αˆ(θˆ).

Combining the preceding results of eqs (5) and (6), we get

(7)n1/2{αˆ(θˆ)α0}n1/2i=1nϕi(α,θ)+n1/2ρIC1i=1nθlogpT|Y(Ti|Yi)

Also the corresponding distributions of those two terms are asymptotically orthogonal to each other, since

(8)E{ϕi(α,θ)θlogpT|Y(Ti|Yi)}=Eϕi(α,θ)EθlogpT|Y(Ti|Yi)|Yi=0

Equations (7) and (8) imply that n1/2{αˆ(θˆ)α0} is asymptotically equivalent to a sum of n i.i.d. zero-mean random variables. By the central limit theorem, it converges weakly to a normal random variable with mean zero and variance σ12=σ2+ρIC1ρT. It is natural to estimate σ12 by σˆ12=σˆ2+ρˆIˆC1ρˆT, where σˆ2 is a consistent estimator of σ2, and ρˆ and IˆC are the corresponding moment-type empirical estimators. The consistency of σˆ2, ρˆ and IˆC implies that σˆ12 is a consistent estimator of σ12.

Appendix B: Asymptotic properties of S^Y,Z(Y,Z) in the copula model without covariates

Proof of Theorem 2

Consider an estimate of SY,Z(y,z) at any given time (y,z)A=[y,y+]×[z,z+] is obtained under the copula model, as SˆY,Z(y,z)=C{SˆY(y),SˆZ(z),αˆ(θˆ)}, where αˆ(θˆ) is the two-stage association parameter estimator, SˆY(y) and SˆZ(z) are the proposed semiparametric consistent estimators. The asymptotic results of SˆY,Z(y,z) are proved under the assumed conditions stated in the proof of Theorem 1 and the following regularity conditions. Assume that the copula function C(u,v,α) is continuous and differentiable at u,v,α, respectively, and the parameter α lies in a compact set.

First of all, we show the consistency of SˆY,Z(y,z). We have that SˆY() converges in probability to SY() uniformly in [y,y+], and SˆZ() converges in probability to SZ() uniformly in [z,z+]. Also by Theorem 1, αˆ(θˆ) converges in probability to α0. Since the copula function C(u,v,α) is a continuous function of u, v and α, then C{SˆY(y),SˆZ(z),αˆ(θˆ)} converges in probability to C{SY(y),SZ(z),α0} uniformly in A=[y,y+]×[z,z+]. Therefore, As n, SˆY,Z(y,z) converges to SY,Z(y,z) in probability.

Next, we illustrate the asymptotic distribution of SˆY,Z(y,z). Using the functional delta method on C{SˆY(y),SˆZ(z),αˆ(θˆ)} around α0, SY and SZ, we get

(9)n1/2[C{SˆY(y),SˆZ(z),αˆ(θˆ)}C{SY(y),SZ(z),α0}]n1/2C{SY(y),SZ(z),α}α{αˆ(θˆ)α0}+n1/2C{SY(y),SZ(z),α0}u(SˆYSY)(y)+n1/2C{SY(y),SZ(z),α0}v(SˆZSZ)(z)

By Theorem 1, n1/2{αˆ(θˆ)α0} converges weakly to normal with mean zero and variance σ12. Therefore, the first term in eq. (9) is asymptotically equivalent to

(10)n1/2C{SY(y),SZ(z),α}ασ1ni=1{S^Y(y),S^Z(z),α}α

which is a sum of n i.i.d. random variables. Applying the counting process asymptotic techniques to SˆY and SˆZ, the sum of the second and the third terms in eq. (9) is asymptotically equivalent to

(11)n1/2i=1nC{SY(y),SZ(z),α0}uI10(Yi)(y)+C{SY(y),SZ(z),α0}vI20(Xi,δi)(z)

where I10(Yi)(y)=SY(y)[0y{dN1i(u)}{p(Yu)}10y{I(Yiu)dΛ1(u)}{p(Yu)}1] and I20(Xi,δi)(z)=SZ(z)[0z{dN2i(u)}{p(Zu,C2u)}10z{I(Xiu)dΛ2(u)}{p(Zu,C2u)}1] with N1i(u)=I(Yiu), N2i(u)=I(Ziu,δi=1) and C2=CTY. Note that eq. (11) is a sum of n i.i.d. random variables, and the expectation of each term in eq. (11) is zero. By the central limit theorem, eq. (11) converges weakly to normal with zero mean and variance Σ(y,z).

Moreover, we have

(12)Elogl{SˆY(y),SˆZ(z),α}α{I10(Yi)+I20(Xi,δi)}=E{I10(Yi)+I20(Xi,δi)}E[logl{SˆY(y),SˆZ(z),α}α|Yi,Xi,δi]=0

which means eqs (10) and (11) are asymptotically orthogonal. Therefore, eqs (10), (11) and (12) imply that as n, the process n1/2{SˆY,Z(y,z)SY,Z(y,z)} converges weakly to a bivariate zero-mean Gaussian process with [C{SY(y),SZ(z),α}/α]2σ12+Σ(y,z) as the covariance function.

Appendix C: Estimations and asymptotic properties in the copula model with covariates

We provide detailed estimation procedures for βˆZ(θ), βˆY(θˆ), βˆZ(θˆ), SˆY and SˆZ in the copula model with covariates, as well as related asymptotic properties and the proof. Consider θ is known. Define a weighted partial score function as

Uj2(θ,βZ)=bjiRj2w(yi,θ)1exp(βZbi)biiRj2w(yi,θ)1exp(βZbi)

where Rj2={i:xix(j)} is the risk set at x(j), and {x(1),,x(k)} are distinct ordered uncensored second failure time. Then the estimating equation for βZ is U2(θ,βZ)=jw(yj,θ)1Uj2(θ,βZ)=0 with the solution denoted by βˆZ(θ). The asymptotic properties of βˆY(θ) and βˆZ(θ) can be established following the lines in Qi el al. [20]. As n, the random vectors n1/2{βˆY(θ)βY} and n1/2{βˆZ(θ)βZ} converge weakly to multivariate normal distributions with mean zero and variance-covariance matrices AY and AZ, respectively. When θ is unknown, we obtain estimators βˆY(θˆ) and βˆZ(θˆ) by replacing θ by the conditional maximum likelihood estimator θˆ in the estimating equations U1(θ,βY)=0 and U2(θ,βZ)=0. The errors of both βˆY(θˆ) and βˆZ(θˆ) can be decomposed into two terms as

βˆY(θˆ)βY={βˆY(θ)βY}+{βˆY(θˆ)βˆY(θ)}
βˆZ(θˆ)βZ={βˆZ(θ)βZ}+{βˆZ(θˆ)βˆZ(θ)}

The first error terms in the above expressions have been discussed, and the second error terms are generated by the use of θˆ to estimate θ. The distributions of the two terms are proved to be asymptotically orthogonal to each other. The asymptotic properties of βˆY(θˆ) and βˆZ(θˆ) as well as the proof are given as follows.

Theorem 3. Asn, the random vectorsβˆY(θˆ)andβˆZ(θˆ)are consistent estimators ofβYandβZ, respectively, n1/2{βˆY(θˆ)βY}andn1/2{βˆZ(θˆ)βZ}converge weakly to multivariate normal distributions with mean zero and variance-covariance matricesAY+AYBYIC1BYTAYandAZ+AZBZIC1BZTAZ, respectively.

Proof of Theorem 3

Observe that

(13)n1/2{βˆY(θˆ)βY}=n1/2{βˆY(θ)βY}+n1/2{βˆY(θˆ)βˆY(θ)}
(14)n1/2{βˆZ(θˆ)βZ}=n1/2{βˆZ(θ)βZ}+n1/2{βˆZ(θˆ)βˆZ(θ)}

Following the lines in Qi et al. [20], the first terms in eqs (13) and (14) can be proved to have the expressions

n1/2{βˆY(θ)βY}=n1/2AYi=1nψi(θ,βY)+op(1)
n1/2{β^Z(θ)βZ}=n1/2AZi=1nφi(θ,βZ)+op(1)

which converge weakly to multivariate normal distributions with mean zero and variance-covariance matrices AY and AZ, respectively. To be specific,

AY=SY(2)(θ,βY,y)SY(0)(θ,βY,y)SY(1)(θ,βY,y)2SY(0)(θ,βY,y)2dFmY(y)
AZ=SZ(2)(θ,βZ,z)SZ(0)(θ,βZ,z)SZ(1)(θ,βZ,z)2SZ(0)(θ,βZ,z)2dFmZ(z)
SY(r)(θ,βY,y)=n1i{w(y,θ)/w(yi,θ)}I(yyi)exp{βYai}air
SZ(r)(θ,βZ,z)=n1i{w(y,θ)/w(yi,θ)}I{zzi}exp{βZbi}bir
ψi(θ,βY)=aiSY(1)(θ,βY,y)SY(0)(θ,βY,y)dMiY(y)
φi(θ,βZ)={aiSZ(1)(θ,βZ,z)SZ(0)(θ,βZ,z)}dMiZ(z)
MiY(u)=I(yiu)0u{w(y,θ)/w(yi,θ)}I(yyi)hY(y|ai)dy
MiZ(u)=I(ziu)0u{w(y,θ)/w(yi,θ)}I(zzi)hZ(z|bi)dz

where for a p×1 vector υ, υr refers to the scalar 1 for r=0, υ for r=1 and υυT for r=2, FmY and FmZ are the cumulative distribution functions of the observed Yi or Zi, respectively. The variability in each of the second term in eqs (13) and (14) essentially results from the use of θˆ to estimate θ. Let BY=E{ψi(θ,βY)/θ} and BZ=E{φi(θ,βZ)/θ}. Under appropriate regularity conditions, the second terms in eqs (13) and (14) can be expressed as

n1/2{βˆY(θˆ)βˆY(θ)}=n1/2AYBYIC1i=1nθlogpT|Y(Ti|Yi)+op(1)
n1/2{βˆZ(θˆ)βˆZ(θ)}=n1/2AZBZIC1i=1nθlogpT|Y(Ti|Yi)+op(1)

which converge weakly to multivariate normal distributions with mean zero and variance-covariance matrices AYBYIC1BYTAY and AZBZIC1BZTAZ, respectively. In both eqs (13) and (14), the two error terms are asymptotically orthogonal since

E{ψi(θ,βY)θlogpT|Y(Ti|Yi)}=Eψi(θ,βY)E{θlogpT|Y(Ti|Yi)|Yi}=0
E{φi(θ,βZ)θlogpT|Y(Ti|Yi)}=E[φi(θ,βZ)E{θlogpT|Y(Ti|Yi)|Yi}]=0

These imply that eqs (13) and (14) converge weakly to multivariate normal distributions with mean zero and variance-covariance matrices AY+AYBYIC1BYTAY and AZ+AZBZIC1BZTAZ, respectively.

From the proportional hazards models, marginal survival functions SY(y) and SZ(z) can be consistently estimated by

SˆY(y,θˆ|a)=Sˆ0Y(y)exp{βˆY(θˆ)a}

and

SˆZ(z,θˆ|b)=Sˆ0Z(z)exp{βˆZ(θˆ)b}

where Sˆ0Y(y)=exp{Hˆ0Y(y)} and Sˆ0Z(z)=exp{Hˆ0Z(z)} are the Breslow estimators for baseline survival functions. The baseline cumulative hazard function estimators are

Hˆ0Y(y)=yj<y[iRj1exp{βˆY(θˆ)ai}]1

and

Hˆ0Z(z)=x(j)<zdj2[iRj2exp{βˆZ(θˆ)bi}]1

where Rj1 is the risk set for Y at yj, dj2 and Rj2 are the failure event set and risk set for Z at x(j) as previously defined.

References

1. ZhuH, WangM-C. Analyzing bivariate survival data with interval sampling and application to cancer epidemiology. Biometrika2012;99:34561.10.1093/biomet/ass009Search in Google Scholar PubMed PubMed Central

2. BilkerWB, WangM-C. A semiparametric extension of the Mann-Whitney test for randomly truncated data. Biometrics1996;52:1020.10.2307/2533140Search in Google Scholar

3. BrookmeyerR. AIDS, epidemics, and statistics. Biometrics1996;52:78196.10.2307/2533042Search in Google Scholar

4. LutaloT, GrayRH, WawerM, SewankamboN, SerwaddaD, LaeyendeckerO, et al. Survival of HIV-infected treatment-naive individuals with documented dates of seroconversion in Rakai, Uganda. AIDS2007;21:S159.10.1097/01.aids.0000299406.44775.deSearch in Google Scholar PubMed

5. KaleebuP, RossA, MorganD, YirrelD, OramJ, RutebemberwaA, et al. Relationship between HIV-1 ENV subtypes A and D and disease progression in a rural Ugandan cohort. AIDS2001;15:2939.10.1097/00002030-200102160-00001Search in Google Scholar PubMed

6. HuangY, LouisTA. Nonparametric estimation of the joint distribution of survival time and mark variable. Biometrika1998;85:78596.10.1093/biomet/85.4.785Search in Google Scholar

7. LinD-Y, SunW, YingZ. Nonparametric estimation of gap time distributions for serial events with censored data. Biometrika1999;86:5970.10.1093/biomet/86.1.59Search in Google Scholar

8. SchaubelDE, CaiJ. Nonparametric estimation of gap time survival functions for ordered multivariate failure time data. Stat Med2004;23:1885900.10.1002/sim.1777Search in Google Scholar PubMed

9. VisserM. Nonparametric estimation on the bivariate survival function with application to vertically transmitted AIDS. Biometrika1996;83:50718.10.1093/biomet/83.3.507Search in Google Scholar

10. ShihJH, LouisTA. Inferences on the association parameters in copula models for bivariate survival data. Biometrics1995;51:138499.10.2307/2533269Search in Google Scholar

11. WangW. Estimating the association parameter for copula models under dependent censoring. J R Stat Soc B2003;65:25773.10.1111/1467-9868.00385Search in Google Scholar

12. Lakhal-ChaiebL, CookR, LinX. Inverse probability of censoring weighted estimates of Kendall’s τ for gap time analyses. Biometrics2010;66:114552.10.1111/j.1541-0420.2010.01404.xSearch in Google Scholar

13. GenestC, GhoudiK, RivestL-P. A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika1995;82:54352.10.1093/biomet/82.3.543Search in Google Scholar

14. WangW, DingAA. On assessing the association for bivariate current status data. Biometrika2000;87:87993.10.1093/biomet/87.4.879Search in Google Scholar

15. AalenOO. Weak convergence of stochastic integrals related to counting process. Z Wahrsch Ver Geb1977;38:26177.10.1007/BF00533158Search in Google Scholar

16. ZengD. Estimating marginal survival function by adjusting for dependent censoring using many covariates. Ann Stat2004;32:153355.10.1214/009053604000000508Search in Google Scholar

17. ShenP-S. Nonparametric analysis of doubly truncated data. Ann Inst Stat Math2008;62:83553.10.1007/s10463-008-0192-2Search in Google Scholar

18. BetenskyRA, FinkelsteinDM. A non-parametric maximum likelihood estimator for bivariate interval censored data. Stat Med1999;18:3089100.10.1002/(SICI)1097-0258(19991130)18:22<3089::AID-SIM191>3.0.CO;2-0Search in Google Scholar

19. Van der VaartAW. Asymptotic statistics. Cambridge: Cambridge University Press, 1998.Search in Google Scholar

20. QiL, WangCY, PrenticeRL. Weighted estimator for proportional hazards regression with missing covariates. J Am Stat Assoc2005;100:125063.10.1198/016214505000000295Search in Google Scholar

Published Online: 2015-1-28
Published in Print: 2015-5-1

© 2015 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 26.4.2024 from https://www.degruyter.com/document/doi/10.1515/ijb-2013-0060/html
Scroll to top button