We propose the NIPA method to predict the outbreak of COVID-19 virus, which consists of three steps. First, we preprocess the raw data of the confirmed number of infected individuals to obtain an SIR time series
vi[1],...,
vi[
n] of the viral state for every city
i. Here, the number of observations is denoted by
n. Second, based on the time series
vi[1],
vi[2],..., we obtain estimates
\(\hat {\delta }_{i}\) and
\(\hat {\beta }_{ij}\) of the unknown spreading parameters
δi and
βij. Third, the estimates
\(\hat {\delta }_{i}\) and
\(\hat {\beta }_{ij}\) result in an SIR model (
3), which we iterate for future times
k to predict the evolution of the 2019-Cov virus. In the following, we give an outline of the first two steps of the prediction method. We refer the reader to Supplementary Information S1 for further details on NIPA.
Data preprocessing
We denote the number of observations by n, which equals the number of days since January 21, 2020. Based on the reported number of infections Nrep,i[k], our goal is to obtain an SIR viral state vector \(v_{i}[k]= (\mathcal {S}_{i}[k], \mathcal {I}_{i}[k], \mathcal {R}_{i}[k])^{T}\) for every city i at any time k=1,...,n. The fraction of susceptible individuals follows as \(\mathcal {S}_{i}[k] = 1 - \mathcal {I}_{i}[k] - \mathcal {R}_{i}[k]\) at any time k≥1. Thus, it suffices to determine the fraction of infectious individuals \(\mathcal {I}_{i}[k]\) and recovered individuals \(\mathcal {R}_{i}[k]\).
The fraction of infectious individuals \(\mathcal {I}_{i}[k]\) follows from the reported fraction of infections \(\mathcal {I}_{{rep}, i}[k]\). To be precise, the reported data is the number Nrep,i[k] of individuals that are detected to be infected by COVID-19. Upon detection of the infection, the respective individuals are hospitalised and, hence, not infectious any more to individuals outside of the hospital. We consider the reported fraction of infections \(\mathcal {I}_{{rep}, i}[k]\) as an approximation for the number of infectious individuals \(\mathcal {I}_{i}[k]\). In fact, the reported fraction of infections \(\mathcal {I}_{{rep}, i}[k]\) lower-bounds the true fraction of infected individuals \(\mathcal {I}_{i}[k]\) for two reasons. First, not all infectious individuals are aware that they are infected. Second, the diagnosing capacities in the hospitals are limited, particularly when the number of infections increases rapidly. Hence, not all infectious individuals that arrive at a hospital can be reported timely.
We do not know the fraction of removed individuals
\(\mathcal {R}_{i}[k]\). At the initial time
k=1, it is realistic to assume that
\(\mathcal {R}_{i}[1]=0\) holds for every city
i. At any time
k≥2, the removed individuals
\(\mathcal {R}_{i}[k]\) could be obtained from (
3), if the curing probability
δi were known. However, we do not know the curing probability
δi. Hence, we consider 50 equidistant
candidate values for the curing probability
δi, ranging from
δmin=0.01 to
δmax=1. We define the set of candidate values as
Ω={
δmin,...,
δmax}. For every candidate value
δi∈
Ω, the fraction of removed individuals
\(\mathcal {R}_{i}[k]\) follows from (
3) at all times
k≥2. Thus, we obtain 50 potential sequences
\(\mathcal {R}_{i}[1],...,\mathcal {R}_{i}[n]\), each of which corresponding to one candidate value
δi∈
Ω. We estimate the curing probability
δi, and hence implicitly the sequence
\(\mathcal {R}_{i}[1],...,\mathcal {R}_{i}[n]\), as the element in
Ω that resulted in the best fit of the SIR model (
3) to the reported number of infections.
The raw time series
\(\mathcal {I}_{{rep},i}[1],..., \mathcal {I}_{{rep},i}[n]\) exhibits erratic fluctuations. There is a single outlier in city
i=1 (Wuhan) at time
k=8 (January 28, 2020), which we replace by
\(\mathcal {I}_{{rep},1}[8]= (\mathcal {I}_{{rep},1}[7]+\mathcal {I}_{{rep},1}[9])/2\). (Potentially, the outlier is due to the increase in the maximum number of individuals that can be diagnosed in Wuhan, from 200 to 2000 individuals per day as of January 27th (
https://m.chinanews.com/wap/detail/zw/sh/2020/01-28/9071697.shtml, unpublished). To reduce the fluctuations, we apply a moving average, provided by the Matlab command
smoothdata
, to the time series
\(\mathcal {I}_{{rep},i}[1],..., \mathcal {I}_{{rep},i}[n]\) of every city
i. The preprocessed time series
\(\mathcal {I}_{i}[1],..., \mathcal {I}_{i}[n]\) equals the output of
smoothdata
.
Network inference
For every city
i, the curing probability
δi is estimated as one of the candidate values in the set
Ω, as outlined above. The remaining task is to estimate the infection probabilities
βij. The goal of
network inference (
Peixoto 2019;
Ma et al. 2019;
Di Lauro et al. 2019;
Timme and Casadiego 2014;
Wang et al. 2016) is to estimate the matrix
B of infection probabilities from the SIR viral state observations
vi[1],...,
vi[
n]. The matrix
B can be interpreted as a weighted adjacency matrix. We adapt a network inference approach (
Prasse and Van Mieghem 2018;
2020), which is based on formulating a set of linear equations and the
least absolute shrinkage and selection operator (LASSO) (
Tibshirani 1996;
Hastie et al. 2015). We remark that the network inference approach (
Prasse and Van Mieghem 2020) is also applicable to general compartmental epidemic models (
Sahneh et al. 2013), such as the Susceptible-Exposed-Infected-Removed (SEIR) epidemic model. The crucial observation from the SIR governing equations (
3) is that
βij appears linearly, whereas the state variables
\(\mathcal {S}_{i}, \mathcal {I}_{i}\) and
\(\mathcal {R}_{i}\) do not. From (
3), the infection probabilities
βij satisfy
$$\begin{array}{*{20}l} V_{i} = F_{i} \left(\begin{array}{ccc} \beta_{i1} \\ \vdots \\ \beta_{iN} \end{array}\right) \end{array} $$
(4)
for all cities
i=1,...,
N. Here, the (
n−1)×1 vector
Vi and the (
n−1)×
N matrix
Fi are given by
$$\begin{array}{*{20}l} V_{i} = \left(\begin{array}{ccc} \mathcal{I}_{i}[2] - (1 - \delta_{i})\mathcal{I}_{i}[1]\\ \vdots \\ \mathcal{I}_{i}[n] - (1 - \delta_{i})\mathcal{I}_{i}[n-1] \end{array}\right) \end{array} $$
(5)
and
$$\begin{array}{*{20}l} F_{i} = \left(\begin{array}{ccc} \mathcal{S}_{i}[1] \mathcal{I}_{1}[1]&... & \mathcal{S}_{i}[1] \mathcal{I}_{N}[1] \\ \vdots & \ddots & \vdots\\ \mathcal{S}_{i}[n-1] \mathcal{I}_{1}[n-1]&... & \mathcal{S}_{i}[n-1] \mathcal{I}_{N}[n-1] \end{array}\right). \end{array} $$
(6)
If the SIR model (
3) were an exact description of the evolution of the coronavirus, then the linear system (
4) would hold with equality. However, the viral state vector
vi[
k] in city
i does not exactly follow the SIR model (
3). Instead, the evolution of the viral state vector
vi[
k] is described by
$$\begin{array}{*{20}l} v_{i} [k + 1] & = f_{\textrm{SIR}}(v_{1}[k],..., v_{N}[k]) + w_{i}[k], \end{array} $$
where the 3×1 vector
fSIR(
v1[
k],...,
vN[
k]) denotes the right-hand sides of the SIR model (
3), and the 3×1 vector
wi[
k] denotes the unknown
model error of city
i at time
k. Due to the model errors
wi[
k], the linear system (
4) only holds approximately. Thus, we resort to estimating the infection probabilities
βij by minimising the deviation of the left side and the right side of (
4). We infer the network by the LASSO (
Tibshirani 1996;
Hastie et al. 2015) as follows:
$$\begin{array}{*{20}l} \begin{aligned} & \underset{\beta_{i1},..., \beta_{iN}}{\operatorname{min}} & & \left\lVert V_{i} - F_{i} \left(\begin{array}{ccc} \beta_{i1} \\ \vdots \\ \beta_{iN} \end{array}\right) \right\rVert^{2}_{2} + \rho_{i} \sum\limits^{N}_{j=1, j\neq i}\beta_{ij} & \\ &{s.t.} & & 0\le \beta_{ij} \le 1, \quad j=1,..., N. &\end{aligned} \end{array} $$
(7)
The first term in the objective function of (
7) measures the deviation of the left side and the right side of (
4). The sum in the objective of (
7) is an
ℓ1–norm regularisation term which avoids overfitting. We choose to not penalise the probabilities
βii, since we expect the infections among individuals within the same city
i to be dominant. The regularisation parameter
ρi>0 is set by cross–validation. The LASSO network inference (
7) allows for the incorporation of a priori knowledge of the contact network
B by adding further constraints to the infection probabilities
βij. We emphasise that an accurate prediction of an SIR epidemic outbreak does not require an accurate network inference (
Prasse and Van Mieghem 2020), see also Supplementary Information S1. If the observed viral state sequence
vi[1],...,
vi[
n] is generated by the SIR model (
3), then NIPA accurately predicts the infection state
\(\mathcal {I}_{i}[k]\). Furthermore, NIPA provides accurate short-term predictions, also when the viral state
vi[
k] does not exactly follow the SIR model (
3), i.e., in the presence of model errors
wi[
k]. We refer the reader to Supplementary Information S1 for further details on NIPA.