Sampling method
Synthetic minority oversampling technique (SMOTE) [
27] is one of the most common ways to solve the dataset imbalance problem. We upsample the minority class and undersample the majority class to build a class-balanced dataset. For each video, each annotation segment is usually visually different. Within the same video, annotations segments labeled as “Not a phase” look different from one another. Considering the above factors, we proposed a sampling method focused on balanced sampling for each annotation segment instead of balanced sampling for each class. For each annotation segment in our video dataset, we randomly sample a fixed number of training samples. Because each annotation segment provides the same number of training samples, we named this training data sampling technique annotation segment balanced sampling (ASBS).
An example for fine-tuning I3D with ASBS is as follows: For each video, the total number of annotation segments is \(n+m\), where n segments belong to surgical phases and m (\(m \le n+1\)) segments do not belong to any surgical phases. To fine-tune I3D on our dataset, during each training epoch, five 20-second video clips are randomly selected inside each annotation segment for each video. Sixty-four frames are sampled from each video clip as one training sample. For each training epoch, we roughly have \(5v(n+m)\) training samples, where v is the total number of the surgical videos in the training dataset. For data augmentation purposes, we sample one frame every a frames when we sample 64 frames from each video clip for each training sample. The constant interval a is an integer and \(4 \le a \le 9\).
Focal loss
Because the duration of the surgical phase varies from each other and a large amount of the data is annotated as “Not a phase,” we have an imbalanced dataset. This class imbalance problem leads our deep learning model to achieve high prediction accuracy for the majority class and poor prediction accuracy for the minority class. Specifically, the deep learning model achieves high prediction accuracy in the “Not a phase” class, and low accuracy in the surgical phase classes. This was significantly seen in the surgical phase classes that lacked training data.
In order to solve the data imbalance problem, a new loss called focal loss [
19] is proposed to tackle the foreground–background class imbalance problem for dense object detection. By reshaping the standard cross-entropy loss with a dynamically scaling factor, the loss associated with easily classifiable examples, which constitute the majority of the dataset, are down-weighted in focal loss. Because of this, focal loss gives less importance to easily classifiable examples and tends to focus on hard examples. In practice, the focal loss function is defined as
$$\begin{aligned} {\hbox {FL}}(p_{t}) = -\alpha (1-p_{t})^\gamma \log (p_{t}) \end{aligned}$$
(1)
where
\(p_{t}\) is the model’s estimated probability for the class,
\(\alpha \) is the balanced variant,
\(\gamma \) is the focusing parameter.
In focal loss, when training samples are correctly classified with a high estimated probability \(p_{t}\), the value of \(\gamma \) powered \(1-p_{t}\) is small, and the loss for those correctly classified samples are significantly down-weighted. Their contribution to total loss is significantly reduced even if they are large in number. In contrast, when training samples are wrongly classified with a low estimated probability \(p_{t}\), the loss is up-weighted. Therefore, deep learning models can focus on difficult examples that were incorrectly classified with a low estimated probability.
Prior knowledge filtering
Most surgical videos contain frames where the surgeon is idle, frames with slight motions, frames missing important visual clues, and frames with various artifacts in the middle of the surgical phase. For such frames in a surgical video, it is hard for the deep learning model to predict accurately. Therefore, there is noise in the raw predictions from the deep learning model.
In order to filter the prediction noise, we investigate in a post-process filtering algorithm and propose the Prior Knowledge Filtering algorithm. We develop the PKF algorithm in consideration of the below aspects:
(1) Phase order: Although many surgical phases are not following a specific order, some surgical phases do follow a specific order. For example, in the sleeve gastrectomy surgical video, the “Exploration/inspection” phase happens at the beginning of the surgery. It is clear that predictions of the “Exploration/inspection” phase at the end of the surgical video are wrong predictions and need correction. We utilize our model to make predictions for the training dataset. After locating the wrong predictions in the training dataset, one option to correct these wrong predictions is to replace them with new surgical phase labels according to phase order and the model’s confidence. The other option is to correct these wrong predictions with the “Not a phase” label. We can compare our corrections with the ground truth and set up prediction correction rules. In the above-mentioned example, replacing the wrong predictions labeled as the “Exploration/inspection” phase with the “Not a phase” label can correct most of the wrong predictions on both the training and validation datasets. Therefore, we can correct those wrong predictions with the “Not a phase” label.
(2) Phase time: In order to calculate the phase time, smooth prediction results must be obtained first. A sliding window approach is used to determine the start time and the end time of each surgical phase prediction segment. We calculate the set of minimum phase time
T with the annotation data for the training dataset.
\(T = \{T_{1}, T_{2}, \ldots , T_{N}\}\) where
N is the total number of phases. For each surgical phase
i, we set the sliding window size by
$$\begin{aligned} W_{i} = \min (\max (W_\mathrm{min}, \eta T_{i}), W_\mathrm{max}) \end{aligned}$$
(2)
where
\(W_\mathrm{min}\) is the minimum sliding window size,
\(W_\mathrm{max}\) is the maximum sliding window size,
\(\eta \) is a weighted parameter. For our specific case, we have one prediction for each second of the video.
\(W_\mathrm{min}\) is set to be 10,
\(W_\mathrm{max}\) is set to be 60,
\(\eta \) is set to be 0.2. We used grid search to select the parameters that allowed us to compare between the ground truth and the workflow predictions in the validation dataset.
For each surgical phase
i, the full video prediction results are fed piece by piece to a sliding window with a length of
\(W_{i}\). Inside the sliding window, we count the prediction frequency value for surgical phase
i. We set the prediction threshold value
\(J_{i}\) by
$$\begin{aligned} J_{i} = \mu _{i} W_{i} \end{aligned}$$
(3)
where
\(\mu _{i}\) is a weight parameter. We set
\(\mu _{i}\) to be 0.5 in this work.
If the prediction frequency value is greater than the prediction threshold value, the prediction result for the middle time step of the sliding window is set to be phase
i. For adjacent predictions that share the same prediction labels, we connect them with the threshold value
\(L_{i}\) which we set to further solve the discontinuous prediction problem. Threshold value
\(L_{i}\) can be calculated by
$$\begin{aligned} L_{i} = \min (\nu _{i} T_{i}, L_\mathrm{max}) \end{aligned}$$
(4)
where
\(L_\mathrm{max}\) is the maximum connection threshold value,
\(\nu _{i}\) is a weight parameter. We set
\(L_\mathrm{max}\) to be 180 and
\(\nu _{i}\) to be 0.4 in this work. Here, grid search was utilized again to pick our parameters.
For each surgical phase
i, we have smoothed prediction results. If prediction segments for different surgical phases overlap with each other, the prediction for the overlap segment is determined by the average model’s confidence calculated by
$$\begin{aligned} C_{i} = \frac{1}{f-e+1}\sum _{t=e}^{f}p_{(t,i)} \end{aligned}$$
(5)
where
e is the start time step for the overlap segment,
f is the end time step for the overlap segment,
\(p_{(t,i)}\) is the predicted probability at class
i at time step
t (
\(e \le t \le f\)).
With the smoothed prediction result, phase time can be calculated for each surgical phase prediction segment. While many surgical phases vary in phase time, we can still correct prediction segments that are too short to be a surgical phase. We can utilize Eq. (
5) to calculate the average model’s confidence for each label for those short segments. After that we can reselect labels for those short segments according to the average model’s confidence. The limitation of this approach is that it cannot filter wrong prediction phase segments that are longer than the corresponding minimum phase time. In this work, instead of utilizing the average model’s confidence, we replace those short segments with the “Not a phase” label.
(3) Phase incidence: Despite the fact that many surgical phases happen multiple times in one surgical video, some surgical phases normally only happen once or less than a fixed incidence number. We calculate the set of maximum phase incidence
I with the annotation data for the training dataset.
\(I = \{I_{1}, I_{2}, \ldots , I_{N}\}\) where
N is the total number of phases. We correct prediction segments according to phase incidence to further filter the precondition noise. For prediction segments that need corrections, we can utilize Eq. (
5) to calculate the average model’s confidence for each label. We can reselect labels according to the average model’s confidence. We can locate wrong prediction segments according to the set of maximum phase incidence
I on the validation dataset. We can further evaluate the reselect labels with the ground truth annotations. In this work, instead of utilizing the average model’s confidence, we replace those segments with the “Not a phase” label.