2.1 The discrete-time survival likelihood
Consider an individual described by its covariate vector
\({\mathbf {x}}\in {\mathbb {R}}^q\). Assume that time is discrete with values
\(0 = \tau _0< \tau _1 < \ldots \), and let
\({\mathbb {T}} = \{\tau _1, \tau _2, \dots \}\) denote the set of positive
\(\tau _j\)’s. The time of an event is denoted
\(T^* \in {\mathbb {T}}\), and our goal is to model the conditional distribution of this event time given the covariate vector
\({\mathbf {x}}\). The probability mass function (PMF) and the survival function for the event time are defined as
$$\begin{aligned} f(\tau _j \,|\,{\mathbf {x}})&= \text {P}(T^* = \tau _j \,|\,{\mathbf {x}}),\nonumber \\ S(\tau _j \,|\,{\mathbf {x}})&= \text {P}(T^*> \tau _j \,|\,{\mathbf {x}}) = \sum _{k > j} f(\tau _k \,|\,{\mathbf {x}}). \end{aligned}$$
(1)
In survival analysis, models are often expressed in terms of the hazard rate rather than the PMF. For discrete time, the hazard rate is defined as
$$\begin{aligned} h(\tau _j \,|\,{\mathbf {x}})&= \text {P}(T^* = \tau _j \,|\,T^* > \tau _{j-1}, {\mathbf {x}}) = \frac{f(\tau _j \,|\,{\mathbf {x}})}{S(\tau _{j-1} \,|\,{\mathbf {x}})} , \end{aligned}$$
and it follows that
$$\begin{aligned} f(\tau _j \,|\,{\mathbf {x}})&= h(\tau _j \,|\,{\mathbf {x}})\, S(\tau _{j-1} \,|\,{\mathbf {x}}), \end{aligned}$$
(2)
$$\begin{aligned} S(\tau _j \,|\,{\mathbf {x}})&= [1 - h(\tau _j \,|\,{\mathbf {x}})]\, S(\tau _{j-1} \,|\,{\mathbf {x}}). \end{aligned}$$
(3)
Note further that from (
3) it follows that the survival function can be expressed as
$$\begin{aligned} S(\tau _j \,|\,{\mathbf {x}})&= \prod _{k=1}^j [1 - h(\tau _k \,|\,{\mathbf {x}})]. \end{aligned}$$
(4)
In most studies, we do not observe all event times. For some individuals, we only have a right-censored observation. To allow for censoring, we let
\(C^* \in {\mathbb {T}}_C = \{\tau _1, \tau _2, \ldots , \tau _m\}\) be a right-censoring time. Here
\(\tau _m\) defines the maximum follow-up time, at which all individuals still at risk are administratively censored. The random variables
\(T^*\) and
\(C^*\) are typically not observed directly, but instead we observe a potentially right-censored event time
\(T = \min \{T^*,\, C^*\}\) and an event indicator
\(D = \mathbbm {1}\{T^* \le C^*\}\). We here follow the common convention in survival analysis that when an event and censoring time coincide, we observe the occurrence of the event. Note that, as
\(C^* \le \tau _m\), we are not able to observe event times
\(T^*\) larger than
\(\tau _m\). Hence, we are restricted to model the distribution of the event times in
\({\mathbb {T}}_C\).
We assume that \(T^*\) and \(C^*\) are conditionally independent given \({\mathbf {x}}\), and that their distributions have no parameters in common. Then we can consider, separately, the contribution to the likelihood of the event time distribution and the censoring distribution. We are, however, typically only interested in modeling the event time distribution.
Now, considering a set of
n independent individuals, each with covariates
\({\mathbf {x}}_i\), event or censoring time
\(t_i\), and event indicator
\(d_i\), the likelihood contribution of each individual
i is given by
$$\begin{aligned} L_i&= {f(t_i \,|\,{\mathbf {x}}_i)}^{d_i} {S(t_i \,|\,{\mathbf {x}}_i)}^{1-d_i}. \end{aligned}$$
(5)
Using this, we can fit models by minimizing the mean negative log-likelihood
$$\begin{aligned} \text {loss}&= - \frac{1}{n} \sum _{i=1}^n \big \{ d_i \log [f(t_i \,|\,{\mathbf {x}}_i)] + (1-d_i) \log [S(t_i \,|\,{\mathbf {x}}_i)] \big \}. \end{aligned}$$
(6)
A useful reformulation of the loss function (
6) is obtained by rewriting it in terms of the discrete hazards. To this end, let
\(\kappa (t) \in \{0, \ldots , m\}\) define the index of the discrete time
t, meaning
\(t = \tau _{\kappa (t)}\). Using (
2), (
3), and (
4), we can then rewrite the likelihood contribution (
5) as
$$\begin{aligned} L_i = {h(t_i \,|\,{\mathbf {x}}_i)}^{d_i} \, {[1 - h(t_i \,|\,{\mathbf {x}}_i)]}^{1-d_i} \, \prod _{j=1}^{{\kappa (t_i)}-1} [1 - h(\tau _j \,|\,{\mathbf {x}}_i)]. \end{aligned}$$
With this formulation, the mean negative log-likelihood in (
6) can be rewritten as
$$\begin{aligned} \text {loss} = - \frac{1}{n} \sum _{i=1}^n \sum _{j=1}^{\kappa (t_i)}\big \{y_{ij} \log [h(\tau _{j} \,|\,{\mathbf {x}}_i)] + (1 - y_{ij}) \log [1 - h(\tau _{j} \,|\,{\mathbf {x}}_i)] \big \}. \end{aligned}$$
(7)
Here,
\(y_{ij} = \mathbbm {1}\{t_i = \tau _j,\, d_i = 1\}\), so
\(\mathbf{y}_i = (y_{i1}, \ldots , y_{i{\kappa (t_i)}})\) is a vector of zeros with a single 1 at the event index
\(\kappa (t_i)\) when
\(t_i\) corresponds to an observed event (
\(d_i = 1\)). We recognize this as the negative log-likelihood for Bernoulli data, or binary cross-entropy, a useful discovery first noted by Brown (
1975).
With the two loss functions (
6) and (
7), we can now make survival models by parameterizing the PMF or the discrete hazard rate and minimizing the corresponding loss. For classical statistical models, these approaches are equivalent and have been used to obtain maximum likelihood estimates for the parameters in the PMF/hazard rate; see Tutz and Schmid (
2016) for a review. We will, however, not consider classical maximum likelihood estimates, but focus on the part of the literature that fits neural networks for the purpose of time-to-event prediction, in which case the two loss functions may give different results.
2.2 Parameterization with neural networks
A neural network
\(\phi ({\mathbf {x}}) \in {\mathbb {R}}^m\) is a parametric, differentiable function of a covariate vector
\({\mathbf {x}}\in {\mathbb {R}}^q\) that minimizes a loss function using a gradient descent approach. While networks typically contain thousands or millions of parameters, simple models such as linear and logistic regression can also be considered neural networks. For a large number of parameters, we are usually not interested in the parameter estimates themselves, but only in the network’s predictive capabilities. While there is a vast literature on various ways to parameterize neural networks, the internal structure of the networks is not that relevant for this paper as we only consider the most standard multilayer perceptron networks, or MLP’s. So, for the purposes of this paper, we think of the network
\(\phi ({\mathbf {x}}) \in {\mathbb {R}}^m\) as some very flexible parametric function of the covariates
\({\mathbf {x}}\). For more on MLP’s and neural networks in general see, e.g., the book by Goodfellow et al. (
2016).
In the previous subsection, we saw that the survival likelihood can be expressed in terms of the PMF or the hazard rate. In the following, we will describe how to use this to create survival prediction methods by parameterizing the PMF or hazard with neural networks. In theory, as both approaches aim at minimizing the same negative log-likelihood, the methods should yield the same results. But due to the nature of neural networks, this might not be the case in practice. Contrary to most parametric statistical models, neural networks are typically overparameterized and a minimum is not obtained for the training loss. Instead, a held-out validation set is monitored, and the iterative optimization procedure is stopped when performance on this validation set starts to deteriorate. Also, considering that neural networks are well known to be sensitive to numerical instabilities, some parameterizations of a likelihood might result in better performance than others.
First, considering the hazard parametrization of the likelihood, let
\(\phi ({\mathbf {x}}) \in {\mathbb {R}}^m\) represent a neural network that takes the covariates
\({\mathbf {x}}\) as input and gives
m outputs. Each output
\(\phi _j({\mathbf {x}})\) corresponds to a discrete time-point
\(\tau _j\), so
\(\phi ({\mathbf {x}}) = {\{\phi _1({\mathbf {x}}), \ldots , \phi _m({\mathbf {x}})\}}\). As the discrete hazards are (conditional) probabilities, we apply the logistic function (sigmoid function) to the output of the network
$$\begin{aligned} h(\tau _j \,|\,{\mathbf {x}}) = \frac{1}{1 + \exp [-\phi _j({\mathbf {x}})]}, \end{aligned}$$
to ensure that
\(h(\tau _j \,|\,{\mathbf {x}}) \in (0, 1)\). We can estimate the hazard rate by minimizing the loss (
7), and survival estimates can be obtained from (
4). To the best of our knowledge, this method was first proposed by Gensheimer and Narasimhan (
2019). However, if one considers the special case where
\(\phi _j({\mathbf {x}}) = \varvec{\beta }^T {\mathbf {x}}\), the approach is well known in the survival literature and seems to have been first addressed by Cox (
1972) and Brown (
1975); see also Allison (
1982). The book by Tutz and Schmid (
2016) gives a review of the approach.
The implementation we use in the experiments in Sects.
4 and
5 differs slightly from that of Gensheimer and Narasimhan (
2019), as it was found to be numerically more stable (see Appendix B). In this paper, we will refer to the method as
Logistic-Hazard, as coined by Brown (
1975), but one can also find the term Logistic Discrete Hazard used in the statistical literature. Gensheimer and Narasimhan (
2019) referred to it as
Nnet-survival, but we will refrain from using that name as we find Logistic-Hazard to be more descriptive.
We can obtain a survival model by parameterizing the PMF in a similar manner to the Logistic-Hazard method. As for the hazards, the PMF
\(f(\tau _j \,|\,{\mathbf {x}})\) represents probabilities, but, contrary to the conditional probabilities that define the hazard, we now require the PMF to sum to 1. As we only observe event times in
\({\mathbb {T}}_C\), we fulfill this requirement indirectly through the probability of surviving past
\(\tau _m\). Thus we have
$$\begin{aligned} \sum _{k=1}^m f(\tau _k \,|\,{\mathbf {x}}) + S(\tau _m \,|\,{\mathbf {x}}) = 1. \end{aligned}$$
(8)
Now, again with
\(\phi ({\mathbf {x}}) \in {\mathbb {R}}^{m}\) denoting a neural network, the PMF can be expressed as
$$\begin{aligned} f(\tau _j \,|\,{\mathbf {x}}) = \frac{\exp [\phi _j({\mathbf {x}})]}{1 + \sum _{k=1}^{m} \exp [\phi _k({\mathbf {x}})]}, \quad \quad \text {for } j = 1, \ldots , m. \end{aligned}$$
(9)
Note that (
9) is equivalent to the softmax function (also used in multinomial logistic regression) with a fixed
\(\phi _{m+1}({\mathbf {x}}) = 0\). Alternatively, one could let
\(\phi _{m+1}({\mathbf {x}})\) vary freely, something that is quite common in machine learning, but we chose to follow the typical conventions in statistics. By combining (
1) and (
8), we can express the survival function as
$$\begin{aligned} S(\tau _j \,|\,{\mathbf {x}}) = \sum _{k=j+1}^{m} f(\tau _k \,|\,{\mathbf {x}}) + S(\tau _m \,|\,{\mathbf {x}}) \end{aligned}$$
(10)
for
\(j=1,\ldots ,m-1\), and
$$\begin{aligned} S(\tau _m \,|\,{\mathbf {x}}) = \frac{1}{1 + \sum _{k=1}^m \exp [\phi _k({\mathbf {x}})]}. \end{aligned}$$
Now, let
\(\sigma _j[\phi ({\mathbf {x}})]\), for
\(j=1,\ldots ,m+1\), denote the softmax in (
9), meaning
\(\sigma _{m+1}[\phi ({\mathbf {x}})] = S(\tau _m \,|\,{\mathbf {x}})\). Notice the similarities to classification with
\(m+1\) classes, as we are essentially classifying whether the event is happening at either time
\(\tau _1, \ldots , \tau _m\) or later than
\(\tau _m\). However, due to censoring, the likelihood is
not the cross-entropy. Instead, by inserting (
9) and (
10) into (
6), we get the mean negative log-likelihood
$$\begin{aligned} \text {loss}&= -\frac{1}{n} \sum _{i=1}^n \left( d_i \log [\sigma _{\kappa (t_i)}(\phi ({\mathbf {x}}_i)) ] + (1-d_i) \log \left[ \sum _{k={\kappa (t_i)}+1}^{m+1} \sigma _k(\phi ({\mathbf {x}}_i)) \right] \right) , \end{aligned}$$
(11)
where
\({\kappa (t_i)}\) still denotes the index of individual
i’s event or censoring time, that is,
\(t_i = \tau _{\kappa (t_i)}\). This is essentially the same negative log-likelihood as presented by Lee et al. (
2018). Note, however, that contrary to the work by Lee et al. (
2018) the negative log-likelihood in (
11) allows for survival past time
\(\tau _m\). Some numerical improvements of the implementation are addressed in Appendix B. We will refer to this method simply by
PMF as this term is unambiguously discrete, contrary to the term
hazard which is used both for discrete and continuous time.
As a side note, the Multi-task logistic regression (Yu et al.
2011), and the neural network extension of this method (Fotso
2018), can be shown to be a PMF model by considering a cumulative sum of the linear predictor, or in the neural network case, the cumulative sum of the output of the network. Details are given in Appendix C.