Skip to main content
main-content

Über dieses Buch

Applied Survival Analysis Using R covers the main principles of survival analysis, gives examples of how it is applied, and teaches how to put those principles to use to analyze data using R as a vehicle. Survival data, where the primary outcome is time to a specific event, arise in many areas of biomedical research, including clinical trials, epidemiological studies, and studies of animals. Many survival methods are extensions of techniques used in linear regression and categorical data, while other aspects of this field are unique to survival data. This text employs numerous actual examples to illustrate survival curve estimation, comparison of survivals of different groups, proper accounting for censoring and truncation, model variable selection, and residual analysis.
Because explaining survival analysis requires more advanced mathematics than many other statistical topics, this book is organized with basic concepts and most frequently used procedures covered in earlier chapters, with more advanced topics near the end and in the appendices. A background in basic linear regression and categorical data analysis, as well as a basic knowledge of calculus and the R system, will help the reader to fully appreciate the information presented. Examples are simple and straightforward while still illustrating key points, shedding light on the application of survival analysis in a way that is useful for graduate students, researchers, and practitioners in biostatistics.

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract
Survival analysis is the study of survival times and of the factors that influence them. Types of studies with survival outcomes include clinical trials, prospective and retrospective observational studies, and animal experiments. Examples of survival times include time from birth until death, time from entry into a clinical trial until death or disease progression, or time from birth to development of breast cancer (that is, age of onset). The survival endpoint can also refer a positive event. For example, one might be interested in the time from entry into a clinical trial until tumor response. Survival studies can involve estimation of the survival distribution, comparisons of the survival distributions of various treatments or interventions, or elucidation of the factors that influence survival times. As we shall see, many of the techniques we study have analogues in generalized linear models such as linear or logistic regression.
Dirk F. Moore

Chapter 2. Basic Principles of Survival Analysis

Abstract
Survival analysis methods depend on the survival distribution, and two key ways of specifying it are the survival function and the hazard function . The survival function defines the probability of surviving up to a point t. Formally,
$$\displaystyle{S(t) = pr(T > t),\begin{array}{cc} &0 < t < \infty \end{array} }$$
This function takes the value 1 at time 0, decreases (or remains constant) over time, and of course never drops below 0. As defined here it is right continuous.
Dirk F. Moore

Chapter 3. Nonparametric Survival Curve Estimation

Abstract
We have seen that there are a wide variety of hazard function shapes to choose from if one models survival data using a parametric model. But which parametric model should one use for a particular application? When modeling human or animal survival, it is hard to know what parametric family to choose, and often none of the available families has sufficient flexibility to model the actual shape of the distribution. Thus, in medical and health applications, nonparametric methods, which have the flexibility to account for the vagaries of the survival of living things, have considerable advantages. In this chapter we will discuss non-parametric estimators of the survival function. The most widely used of these is the product-limit estimator, also known as the Kaplan-Meier estimator . This estimator, first proposed by Kaplan and Meier [35], is the product over the failure times of the conditional probabilities of surviving to the next failure time. Formally, it is given by
$$\displaystyle{\hat{S}(t) =\prod \limits _{t_{i}\leq t}(1 -\hat{q}_{i}) =\prod \limits _{t_{i}\leq t}\left (1 - \frac{d_{i}} {n_{i}}\right )}$$
where n i is the number of subjects at risk at time t i , and d i is the number of individuals who fail at that time. The example data in Table 1.​1 may be used to illustrate the construction of the Kaplan-Meier estimate, as shown in Table 3.1.
Table 3.1
Kaplan-Meier estimate
t i
n i
d i
q i
1 − q i
\(S_{i} =\prod (1 - q_{i})\)
2
6
1
0.167
0.833
0.846
4
5
1
0.200
0.800
0.693
6
3
1
0.333
0.667
0.497
Dirk F. Moore

Chapter 4. Nonparametric Comparison of Survival Distributions

Abstract
Testing the equivalence of two groups is a familiar problem in statistics. Typically we are interested in testing a null hypothesis that two population means are equal versus an alternative that the means are not equal (for a two-sided test) or that the mean for an experimental treatment is greater than that for a standard treatment (one-sided test). We compute a test statistic from the observed data, and reject the null hypothesis if the test statistic exceeds a particular constant. The significance level of the test is the probability that we reject the null hypothesis when the null hypothesis is in fact true. A widely known test is the two-sample Students t-test for continuous observations, which requires the assumption that the observations are normally distributed. If the normal distribution assumption is in doubt, a rank-based test called the Mann-Whitney test may be used, which gives valid test results without making parametric assumptions. With survival data, if we are willing to assume that the data follow a particular parametric distribution, we can use likelihood theory to construct a test for equivalence of the two distributions, as we shall see in Chap. 10 However, as we have discussed in the previous chapters, survival data from biomedical experiments or clinical trials generally doesn’t lend itself to analysis by parametric methods. Thus, we shall construct nonparametric tests of equivalence of two survival functions, H 0: S 1(t) = S 0(t). Typically, S 1 and S 0 will represent the survival distributions for, respectively, an experimental and a control therapy. Now, a statistical hypothesis test (in the classical hypothesis testing framework) also requires us to specify an alternative hypothesis, and one might at first try to specify a one-sided alternative H A : S 1(t) > S 0(t) or two-sided alternative H A : S 1(t) ≠ S 0(t). Unfortunately, things aren’t so simple in survival analysis, since the alternative can take a wide range of forms. What if the survival distributions are similar for some values of t and differ for others? What if the survival distributions cross? How do we want our test statistic to behave under these different scenarios? One solution is to consider what is called a Lehman alternative, \(H_{A}: S_{1}(t) = \left [S_{0}(t)\right ]^{\psi }\). Equivalently, we can view Lehman alternatives in terms of proportional hazards as h 1(t) = ψ h 0(t). Either way we would construct a one sided test as H 0: ψ = 1 versus H A : ψ < 1, so that under the alternative hypothesis S 1(t) will be uniformly higher than S 0(t) and h 1(t) uniformly lower than h 0(t) (i.e. subjects in Group 1will have longer survival times than subjects in Group 0). As we shall see, we can construct a test statistic using the ranks of the survival times. While these rank-based tests are similar to the Mann-Whitney test, the presence of censoring complicates the assignment of ranks. Thus, we initially take an alternative approach to developing this test, where we view the numbers of failure and numbers at risk at each distinct time as a two-by-two table. That is, for each failure time t i we may construct a two-by-two table showing the numbers at risk (n 0i and n 1i for the control and treatment arms, respectively) and the number of failures (d 0i and d 1i , respectively). Also shown in the table are the “marginals”, that is, the row and column sums. For example, we have \(d_{i} = d_{0i} + d_{1i}\) and \(n_{i} = n_{0i} + n_{1i}\). We first order the distinct failure times. Then for the i’th failure time, we have the following table:
Dirk F. Moore

Chapter 5. Regression Analysis Using the Proportional Hazards Model

Abstract
In the previous chapter we saw how to compare two survival distributions without assuming a particular parametric form for the survival distributions, and we also introduced a parameter ψ that indexes the difference between the two survival distributions via the Lehmann alternative, \(S_{1}(t) = \left [S_{0}(t)\right ]^{\psi }\). Using Eq. 2.2.1 we can see that we can re-express this relationship in terms of the hazard functions, yielding the proportional hazards assumption ,
$$\displaystyle{ h_{1}(t) =\psi h_{0}(t). }$$
(5.1.1)
This equation is the key to quantifying the difference between two hazard functions, and the proportional hazards model is widely used. (Later we will see how to assess the validity of this assumption, and ways to relax it when necessary.) Furthermore, we can extend the model to include covariate information in a vector z as follows:
$$\displaystyle{ \psi = e^{z\beta }. }$$
(5.1.2)
While other functional relationships between the proportional hazards constant ψ and covariates z are possible, this is by far the most common in practice. This proportional hazards model will allow us to fit regression models to censored survival data, much as one can do in linear and logistic regression. However, not assuming a particular parametric form for h 0(t), along with the presence of censoring, makes survival modeling particularly complicated. In this chapter we shall see how to do this using what we shall call a partial likelihood . This modification of the standard likelihood was developed initially by D.R. Cox [12], and hence is often referred to as the Cox proportional hazards model.
Dirk F. Moore

Chapter 6. Model Selection and Interpretation

Abstract
Survival analysis studies typically include a wealth of clinical, demographic, and biomarker information on the patients as well as indicators for a therapy or other intervention. If the study is a randomized clinical trial, the focus will be on comparing the effectiveness of different treatments. A successful randomization procedure should ensure that confounding covariates are balanced between the treatments. Still, we may wish to include such covariates in the model to adjust for any differences that may have arisen, and also to understand how these other factors affect survival. If the study is based on observational data, and if there is a primary intervention of interest, then adjustment for potential confounders is essential to obtaining a valid estimate of the intervention effect. The effect of other covariates on survival will also be of interest in such a study, and in some applications discovery and quantification of explanatory variables may be the primary goal. Regardless of the type of study, we will need methods to sift through a potentially large number of potential explanatory variables to find the important ones.
Dirk F. Moore

Chapter 7. Model Diagnostics

Abstract
The use of residuals for model checking has been well-developed in linear regression theory (see for example Weisberg, 2014 [77]). The residuals are plotted versus some quantity, such as a covariate value, and the observed pattern is used to diagnose possible problems with the fitted model. Some residuals have the additional property of not only indicating problems but also suggesting remedies. That is the pattern of the plotted residuals may suggest an alternative model that fits the data better. Many of these residuals have been generalized to survival analysis. In addition, the fact that survival data evolves over time, and requires special assumptions such as proportional hazards, makes it necessary to develop additional diagnostic residual methods.
Dirk F. Moore

Chapter 8. Time Dependent Covariates

Abstract
The partial likelihood theory for survival data, introduced in Chap. 5, allows one to model survival times while accommodating covariate information. An important caveat to this theory is that the values of the covariates must be determined at time t = 0, when the patient enters the study, and remain constant thereafter. This issue arises with survival data because such data evolve over time, and it would be improper to use the value a covariate to model survival information that is observed before the covariate’s value is known. To accommodate covariates that may change their value over time (“time dependent covariates”) , special measures are necessary to obtain valid parameter estimates. An intervention that occurs after the start of the trial, or a covariate (such as air pollution exposure) that changes values over the course of the study are two examples of time dependent variables.
Dirk F. Moore

Chapter 9. Multiple Survival Outcomes and Competing Risks

Abstract
Until now the type of survival data we have considered has, as an endpoint, a single cause of death, and the survival times of each case have been assumed to be independent. Methods for analyzing such survival data will not be sufficient if cases are not independent or if the event is something that can occur repeatedly. An example of the first type would be clustered data. For instance, one might be interested in survival times of individuals that are in the same family or in the same unit, such as a town or school. In this case, genetic or environmental factors mean that survival times within a cluster are more similar to each other than to those from other clusters, so that the independence assumption no longer holds. In the second case, if the event of interest is, for example, the occurrence of a seizure, the event may repeat indefinitely. Then we would have multiple times per person. Special methods are needed to handle these types of data structures, which we shall discuss in Sect. 9.1. A different situation arises when only the first of several outcomes is observable, a topic we will take up in Sect. 9.2.
Dirk F. Moore

Chapter 10. Parametric Models

Abstract
In biomedical applications, non-parametric (e.g. the product-limit survival curve estimator) and semi-parametric (e.g. the Cox proportional hazards model) methods play the most important role, since they have the flexibility to accommodate a wide range of hazard function forms. Still, parametric methods have a place in biomedical research, and may be appropriate when survival data can be shown to approximately follow a particular parametric form. Parametric models are often much easier to work with than the partial-likelihood-based models we have discussed in earlier chapters, since the former are defined by a small and fixed number of unknown parameters. This allows us to use standard likelihood theory for parameter estimation and inference. Furthermore, accommodating complex censoring and truncation patterns is much more direct with parametric models than with partial likelihood models. Of course, the validity of these techniques depends heavily on the appropriateness of the particular parametric model being used. In Chap. 2 we introduced the exponential, Weibull, and gamma distributions, and mentioned several others that could potentially serve as survival distribution models. In this chapter we will emphasize the exponential and Weibull distributions, since these are the most commonly used parametric distributions. We will also briefly discuss the use of some other parametric models in analyzing survival data.
Dirk F. Moore

Chapter 11. Sample Size Determination for Survival Studies

Abstract
Deciding how many subjects to include in a randomized clinical trial is a key component of its design. In the classical hypothesis testing framework, for any type of outcome, one must specify the effect change one is aiming for, the inherent variability in the test statistic, the significance level of the test, and the desired power of the test to detect the effect change. In survival analysis, there are additional factors that one must specify regarding the censoring mechanism and the particular survival distributions in the null and alternative hypotheses. First, one needs either to specify what parametric survival model one is using, or that the test will be semi-parametric, e.g., the log-rank test. This allows for determining the number of deaths (or events) required to meet the power and other design specifications. Second, one must, for administrative reasons, provide an estimate of the number of patients that need to be entered into the trial to produce the required number of deaths. We shall assume that the clinical trial is run as described in Chap. 1, where patients enter a trial over a certain accrual period of length a, and then followed for an additional period of time f known as the follow-up time. Patients still alive at the end of follow-up are censored. We will describe sample size methods for single arm clinical trials and then for two arm clinical trials.
Dirk F. Moore

Chapter 12. Additional Topics

Abstract
The exponential distribution, with its constant hazard assumption, is too inflexible to be useful in most lifetime data applications. The piecewise exponential model, by contrast, is a generalization of the exponential which can offer considerable flexibility for modeling. In Chap. 2 (Exercise 2.5) we saw a simple piecewise exponential model with two “pieces”. That is, the survival time axis was divided into two intervals, with a constant hazard on each interval. Here we show how to generalize this model to accommodate multiple intervals on which the hazard is constant. An important feature of the piecewise exponential is that the likelihood is equivalent to a Poisson likelihood. Thus, we can use a Poisson model-fitting function in R to find maximum likelihood estimates of the hazard function and of parameters of a proportional hazards model.
Dirk F. Moore

Erratum

Without Abstract
Dirk F. Moore

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise