Multi-task encoder
Let \(\mathbf {x} \in \{0,255\}^{W,H,3}\) be an RGB image with width W, height H and 3 colour channels. Let \(\mathbb {E}(\cdot ): \mathbf {x} \rightarrow (\hat{\mathbf {S}}^{S, W, H}, \hat{\mathbf {P}}^{P})\) be the proposed multi-task encoder composed of two branches that jointly estimates the scene segmentation of surgical instruments and anatomy \(\hat{\mathbf {S}}^{S, W, H}\), and the surgical phase \(\hat{\mathbf {P}}^{P}\) where S, and P are, respectively, the number of scene, and phase classes.
A simplified diagram of the proposed multi-task encoder architecture is depicted in Fig.
1. The proposed encoder is composed of a shared backbone (i.e., ResNet50 without the last residual block),
\(\mathbb {B}(\cdot ): \mathbf {x} \rightarrow \mathbf {f}_B\), that given an image
\(\mathbf {x}\) generates task-agnostic high-level features
\(\mathbf {f}_B\). The features generated by the backbone,
\(\mathbf {f}_B\), are then fed to the two branches, namely:
scene segmentation and
phase branches.
Scene segmentation branch. The scene segmentation branch is composed of the last residual block of the encoder, namely scene head
\(\mathbb {S}(\cdot ): \mathbf {f}_B \rightarrow \mathbf {f}_S\) that generates scene-specific features
\(\mathbf {f}_S\); and a segmentation module,
\(\mathbb {T}(\cdot ): \mathbf {f}_S \rightarrow \hat{\mathbf {S}}^{S, W, H}\) that estimates the pixel-wise semantic segmentation of the frame. The segmentation module first performs a bilinear interpolation of the features that upscales their spatial dimension four times,
\(\mathbb {U}_1(\cdot )\)); and then applies a 3-by-3 convolution,
\(\mathbb {C}_{3\times 3}(\cdot )\), and batch-norm layer,
\(BN(\cdot )\), while reducing by four the number of channels from 2048 to 512. After that, a rectified linear unit,
\(ReLU(\cdot )\), is applied, and a final 1-by-1 convolution,
\(\mathbb {C}_{1\times 1}(\cdot )\), with
S scene classes output channels, and a bilinear interpolation to upscale the estimated segmentation mask to the original frame resolution,
\(\mathbb {U}_2(\cdot )\). We formulate the learning of this branch as a multi-class problem, which is trained with a cross-entropy loss after a Softmax activation function,
\(Softmax(\cdot )\). In summary, the estimated segmentation is computed as
$$\begin{aligned} \hat{\mathbf {S}}^{S, W, H}= & {} Softmax\,(\,\mathbb {U}_2\,(\,\mathbb {C}_{1\times 1}\,(\,ReLU\,(\,BN\,(\,\mathbb {C}_{3\times 3}\,\nonumber \\&(\,\mathbb {U}_1\,(\,\mathbb {S}\,(\,\mathbf {f}_B\,)\,)\,)\,)\,)\,)\,)\,), \end{aligned}$$
(1)
and learnt using the following loss function
\(\mathcal {L}_S = CE(\mathbf {S}^{S, W, H}, \hat{\mathbf {S}}^{S, W, H})\), where CE is the cross-entropy loss and
\(\mathbf {S}^{S, W, H}\) is the scene segmentation annotation. As it is known, segmentation annotations are expensive to generate; therefore, we consider the scenario where only a small amount of frames have such annotations. While we compute the scene branch for all the frames, as the scene features are used by the phase branch; we only perform backpropagation for the frames where the scene annotation is available by using the previous loss function. Non-annotated frames do not contribute to the scene loss.
Phase branch. The phase branch is composed of the last residual block of the encoder, namely phase head
\({\mathbb {P}(\cdot ): \mathbf {f}_B \rightarrow \mathbf {f}_P}\), that generates phase-specific features,
\(\mathbf {f}_P\), a fusion module,
\(\mathbb {F}(\cdot )\), that combines all the task-specific features generated by all the branches, a global average pooling,
GAP and a fully connected layer,
\(\mathbb {F}\). We use a
Fast normalised fusion module [
14] that is a simple and lightweight module that effectively fuses features, and it provides good performance, fast and stable learning stability. The fusion module,
\(\mathbb {F}(\cdot ): (\mathbf {f}_S,\mathbf {f}_P) \rightarrow \mathbf {f}\), learns to combine the task-specific scene and phase features into a fused feature,
\(\mathbf {f}\), as:
$$\begin{aligned} \mathbf {f} = \frac{ReLU(\alpha _S)}{\sum _{\forall i}{ReLU(\alpha _i)} + \epsilon }\,\mathbf {f}_S + \frac{ReLU(\alpha _P)}{\sum _{\forall i}{ReLU(\alpha _i)} + \epsilon }\,\mathbf {f}_P, \end{aligned}$$
(2)
where
\(\alpha _S\) and
\(\alpha _P\) are learnable weights, and
\(\epsilon =0.0001\) is a small scalar for numerical stability. We formulate the learning of this branch as a multi-class problem, which is trained with a cross-entropy loss after a Softmax activation function. In summary, the estimated phase is computed as:
$$\begin{aligned} \hat{\mathbf {P}}^{P} = Softmax\,(\,\mathbb {F}\,(\,GAP\,(\,\mathbf {f}\,)\,)\,), \end{aligned}$$
(3)
and learnt using the following loss function
\(\mathcal {L}_P = CE(\mathbf {P}^{P}, \hat{\mathbf {P}}^{P})\), where
\({\mathbf {P}^{P}}\) is the phase annotation.
In summary, the multi-task encoder is trained as
\(\mathcal {L} = \mathcal {L}_S + \mathcal {L}_P\). Once the multi-task encoder is trained, we freeze its weights, and extract features for every frame from Eq. (
3), after discarding the fully connected layer, and activation function.
Multi-stage temporal convolutional network
The majority of the literature relies on recurrent neural networks, which are inefficient and slow at capturing very long-term temporal patterns as they often are trained using a sliding window approach. Instead, we use dilated causal Multi-Stage TCN [
15] as a temporal model as they have shown accurate, lightweight, and fast surgical phase estimation [
12]. Their large temporal receptive field captures the full temporal resolution with a reduced number of parameters, allowing for faster training and inference time and leveraging untrimmed surgical videos. Specifically, we use a two-stage causal TCN,
\(TCN(\cdot ): \mathbf {f} \rightarrow \hat{\mathbf {P}}_T^{P}\), that learns to leverage the temporal relationships of the multi-task fused features generated by the encoder,
\(\mathbf {f}\), to estimate the final phase predictions,
\(\hat{\mathbf {P}}_T^{P}\). The TCN is solely constructed with causal temporal convolutional layers, avoiding the use of pooling or fully connected layers to maintain the feature maps at a fixed dimension. Unlike [
5], we propose to train the TCN using a cross-entropy loss and a truncated mean squared error in the temporal domain [
15] as:
$$\begin{aligned} \mathcal {L}_T = CE\,(\,\mathbf {P}_T^{P}, \hat{\mathbf {P}}^{P}\,) + C_0^c(\,\mathbf {P}_T^{P}- \hat{\mathbf {P}}^{P}\,)^2, \end{aligned}$$
(4)
where
\(C_0^c(\cdot )\) is the clamp operator,
c the maximum clamping value, and
\(\mathbf {P}^{P}\) is the phase annotation. The mean squared error term helps the temporal model to obtain smoother predictions in the time domain.