The question arises as of how frequently to get new predictions for an instance of interest awaiting its true label. Intuitively, we expect the unlabelled instance to be used as an input for a prediction model more frequently when: (a) the period between getting instance data for the first time and receiving delayed label is long, or (b) multiple new labelled instances \(\mathcal {S}_m\) arrive before the ground truth label \( y _k\) becomes available. This is because numerous \(\mathcal {S}_m\) instances may cause the evolution of the model h during this period. Hence, let us propose continuous re-evaluation to be parameterised with the number K of labelled instances \(\{(\mathbf {x}_i, y _i)\}\) that have to appear in the stream between two consecutive \(\hat{y}_k^{t_j},\hat{y}_k^{t_{j+1}}\) predictions made for the instance \(\{(\mathbf {x}_k,?)\}\). In other words, a new prediction for \(\{(\mathbf {x}_k,?)\}\) will be made every K new labelled instances until true label \( y _k\) arrives.
The proposed continuous re-evaluation algorithm is summarised in Algorithm 1. First, the algorithm checks whether an unlabelled instance, or a true label arrived. In the former case of an instance awaiting its true label, the instance is placed in a buffer L and the first prediction is made for it and placed in bin \(b=0\). When a true label arrives, the prediction immediately preceding model updates is placed in a bin \(B+1\). Next, performance measures are updated based on all the buffered predictions made for the instance. Furthermore, for every other instance \(m \in L\) still awaiting its true label, the number of other labelled instances received since t(m) is checked. Every K labelled instances a new prediction is made by the most recent model. In this way possible updates to the model can be reflected in refined predictions, which are immediately published and can be used in scenarios such as on-line monitoring of industrial processes. Such periodic predictions are all placed in a placeholder bin \((b=-1)\) to be shuffled into proper bins once true label latency and bin length is known. Last, but not least the model \(h_i\) is updated.
The way performance measures are updated is particularly important. This is achieved by updatePerformanceMeasures
method, and takes as an input all prediction tuples P(k) obtained for the instance of interest, the true label \( y _k\) and the time this true label arrived t(\(\mathcal {S}_i\)). The way performance measures can be calculated based on prediction tuples is proposed below.
Let us propose binned performance to be defined as an average performance of the models available in bth bin between attaining instance data and receiving its true label, where \(b=0,\ldots ,B+1\) stands for bin index. We propose that every labelled instance contributes to binned performance with the same weight. The primary objective for binned performance is to enable visualization of performance evolution under evolving predictions. The performance can be defined through various indicators, including, but not limited to accuracy and kappa statistic. The first bin, denoted by \(b=0\) stands for the model performance observed at the time the instance data becomes available for the first time. The last bin, denoted by \(b=B+1\) stands for test-then-train performance. All the remaining bins are used to aggregate performance indicators of evolving predictions for individual subperiods of the periods between \(t(\mathcal {S}_i)\) and \(\tilde{t}(\mathcal {S}_i)\). The way such aggregation is performed for individual instances and the entire data stream is what we propose below.
Once true label becomes available, individual predictions can be linked to the appropriate bins. Let us observe that for
\(1 \le b\le B\), the sequence of predictions
\(U(\mathcal {S}_i,b)\) available for instance
\(\mathcal {S}_i=\{(\mathbf {x}_k,?)\}\) and period
b includes all predictions for the period of this bin. More precisely this means the prediction available at the beginning of bin period, predictions made during this period and the prediction available at the end of it. Formal definition of
\(U(\mathcal {S}_i,b)\) is provided in “Appendix
1”.
In the case of flight data, these predictions would contain possibly changing predicted flight status displayed at the beginning of and during bin period. Hence, the set of predictions made for
\(\mathcal {S}_i\) and linked to bin
b is as follows:
$$\begin{aligned} R(\mathcal {S}_i,b)= {\left\{ \begin{array}{ll} \{(y_i,t_i)\in P(k): b_i=0\} &{} \quad b=0\\ \{(y_i,t_i)\in U(\mathcal {S}_i,b)\} &{} \quad b\in \{1,\ldots ,B\} \\ \{(y_i,t_i)\in P(k): b_i=B+1\} &{} \quad b=B+1 \end{array}\right. } \end{aligned}$$
It follows from Algorithm 1 that
\(R(\mathcal {S}_i,b)\) contains exactly one element for
\(b\in \{0,B+1\}\) i.e. first time prediction and test-then-train prediction, respectively. In the case of periodic predictions, i.e.,
\(b\in \{1,\ldots ,B\}\),
\(card(R(\mathcal {S}_i ,b))> 1\). The question arises how to calculate performance indicators for individual bins, when it is possible that multiple, possibly different predictions exist for one bin period and one instance. We propose to handle such possibly diverse predictions in the same way the output of probabilistic classifier, taking the form of a vector of probabilities, is mapped to a single predicted class. Hence, we propose to determine a single
averaged prediction \(d(U(\mathcal {S}_i,b))\). The way the averaged prediction is selected is defined by
d() function, which is a parameter of Algorithm 1. In the case of classification, we propose to use a function selecting dominating prediction i.e. the prediction displayed during the largest part of the bin period as an averaged prediction. Such prediction is defined as follows:
\(d_\mathrm {D}(U(\mathcal {S}_i,b))=arg\,max_c \sum _{j=2}^{card(U(\mathcal {S}_i,b))}(t_j-t_{j-1}): y_{j-1}=c\). In other words, we propose to determine the prediction that was announced for the longest period of time during bin subperiod, i.e. the prediction with the highest probability. In the case of airlines use case, the prediction would be the predicted delay status that was displayed during the longest total period of time during bin period. It is worth noting here that frequently multiple identical predictions contained in
\(R(\mathcal {S}_i,b)\) will be replaced with a single one.
Hence, for every instance and bin combination, one predicted label
\(\hat{y}_k\) is used to calculate performance measures. For every bin
\(b\in \{0,\ldots ,B+1\}\),
\(V(\mathcal {S}_i,b)\) denotes the pair
\(\{( y_k ,\hat{y}_k)\}\) of true label and predicted label used to calculate performance measures. In the case of bins
\(b\in \{0,B+1\}\) a predicted label
\(\hat{y}_k\) is the first time and test-then-train prediction for
\(\mathcal {S}_i\) instance, respectively. In the case of remaining bins i.e. the bins representing periodic predictions, the predicted label
\(\hat{y}_k\), which is used to calculate performance measures is the averaged prediction for
\(\mathcal {S}_i\) instance in
\(b-th\) subperiod. Formal definition of
\(V(\mathcal {S}_i,b)\) is provided in “Appendix
1”.
Furthermore, let us also propose aggregate loss function \(\lambda \) over a single bin b to be applied to a sequence of tuples \(V(\mathcal {S}_i,b), \ldots V(\mathcal {S}_j,b),\ldots \) gradually produced for individual labelled instances. In particular, \(\forall i,j \, \tilde{t}(\mathcal {S}_i)\le \tilde{t}(\mathcal {S}_j)\). Moreover, every labelled instance contributes to performance evaluation of every bin. Since performance evaluation is not affected by varied number of predictions in bth sub-period for individual instances, performance evolution is due to changes in the models, not due to the varied sets of instances and their predictions used to calculate performance for different bins. Furthermore, for \(B=0\), the formula resolves to performance assessment taking into account only predictions made at the time of receiving unlabelled instance data for the first time (b=0) and at the time preceding receiving true label (\(b=B+1\)).
Finally, let us propose aggregate loss function to be defined as a function of bin index: \(\varLambda (T,b)=\lambda \big (V(\mathcal {S}_i,b), \ldots , V(\mathcal {S}_j,b),\ldots : \mathcal {S}_i,\mathcal {S}_j \in \varOmega (T)\big )\), where \(\varOmega (T)\) denotes a sequence of all labelled instances \(\{(\mathbf {x}_k, y _k)\}\) that arrived until time T, i.e. \(t(\{(\mathbf {x}_k, y _k)\})\le T\) , and \(b\in \{0,\ldots ,B+1\}\). The function provides numerical assessment of the performance of evolving classification models as a function of discretised proportion of time that elapsed between getting instance data and receiving their true labels.
It is important to note that every \(V(\mathcal {S}_i,b)\) takes the form of one pair \(\{( y_k ,\hat{y}_k)\}\). Hence, for every bin index, performance assessment performed by \(\varLambda (T,b)\) is based on the same sequence of true labels \( y_k \) matched with \(\hat{y}_k\) predicted labels, which depend on b value. The predicted labels \(\hat{y}_k\) are: first time predictions (for \(b=0\)), averaged predictions (for \(b\in \{1,\ldots ,B\}\) i.e. for individual subperiods between first time prediction and true label arrival); and test-than-train predictions (for \(b=B+1\)). Hence, this approach is uniform for all these three cases. In particular, \(\varLambda (T,0)\) provides aggregate loss value for the first time predictions, \(\varLambda (T,B+1)\) yields performance assessment calculated for test-then-train mode, and \(\varLambda (T,b), b\in \{0,\ldots ,B\}\) provides aggregate loss value for the predictions available in individual subperiods of the periods preceding true label arrivals. Hence, the aggregate loss function \(\varLambda (T,b)\) can be considered the generalisation of previously existing performance indicators used for delayed label setting.
To calculate \(\varLambda (T,b)\), different loss functions can be applied. Without loss of generality, a zero-one loss function where the cost of misclassification \(\xi ( y _a,\hat{ y _a})=0\) if \(\hat{ y _a}= y _a\) and 1 otherwise will be used throughout the rest of this study. In this case, \(\varLambda (T,b)=\frac{\sum _{\mathcal {S}_i \in \varOmega (T)} \xi ( y_k ,\hat{y}_k):( y_k ,\hat{y}_k)=V(\mathcal {S}_i,b)}{card(\varOmega (T))}\).