Introduction
-
Endoscopy: The surgeon uses the endoscope itself to do the procedure. The endoscope goes in through the body’s natural openings, without the surgeon making any cuts.
-
Robot-assisted surgery (robotic surgery): The surgeon makes several small cuts to guide the endoscope and robotic tools into the body. From there, the surgeon controls the surgery while sitting nearby a computer console.
-
Laparoscopy: The surgeon uses a laparoscope, which is a thin tube fixed with a light and camera and several other small surgical instruments.
Related works
Background and methods
Machine learning and deep learning
-
Feature extraction (Fig. 2): A machine learning workflow starts with relevant features being manually extracted from images. The features are then used to create a model that classifies the objects in the image. However, with a deep learning workflow, relevant features are automatically extracted from images.×
-
Data size matter: Deep learning algorithms scale with data which means that they often continue to improve as the size of your data rises. On the other hand, most shallow learning (machine learning) methods stop improving the accuracy at a certain level of performance when you add more examples and training data to the network.
-
Input data: Machine learning algorithms almost always require structured data, while deep learning networks rely on layers of ANN (artificial neural networks).
Computer vision and convolutional neural networks
-
The Hidden layer/Feature extraction part: In this part, the network will perform a series of convolutions and pooling operations during which the features are detected.
-
The Classification part: Here, the fully connected layers will serve as a classifier on top of these extracted features by assigning a probability for the predicted object on the image.
-
Input layer: Convolutional Neural Networks take advantage of the fact that the input consists of images and they constrain the architecture in a more sensible way. In particular, unlike a regular Neural Network, the layers of a ConvNet have neurons arranged in 3 dimensions: width, height, and depth, where depth is generally the number of color channels used (RGB = 3, CYMK = 4, HSV = 3).
-
Convolutional layer: is the core building block of a CNN. It generates a feature map, also called an activation map (a highlight of the relevant features of the image) using a feature detector. A feature detector, also known as a kernel or a filter, moves across the receptive field of the image, checking if the feature is present. Furthermore, the first layers detect basic features such as horizontal and vertical edges, while the next layer extracts more complex features. This process is known as convolution. The feature detector is a two-dimensional (2-D) array of weights, which represents part of the image. While they can vary in size, the filter size is typically a 3 × 3 matrix. The most common activation function used in this layer is ReLU.
-
Pooling layer: conducts dimensionality reduction, by reducing the number of parameters in the input. The kernel applies an aggregation function to the values within the receptive field, populating the output array. There are two main types of pooling: Max pooling and average pooling. Max pooling selects the pixel with the maximum value to send to the output array, while average pooling calculates the average value within the receptive field to send to the output array.
-
Fully connected layer: in the fully connected layer, each node in the output layer connects directly to a node in the previous layer. This layer performs the task of classification based on the features extracted through the previous layers and their different filters. The classification layer outputs a set of confidence scores using generally a softmax and sigmoid activation functions that specify how likely the images belong to a “class”.
Training a convolutional neural network
-
Training from scratch: refers to building a deep network such as CNN in order to learn the features and model. In addition, to train it from scratch, a very large labeled data set is needed. This is a less common approach because with the large amount of data and rate of learning, these networks typically take days or weeks to train.
-
Transfer learning [28]: is a concept where you transfer the weights of an already trained model (pre-trained) to another problem set on a different dataset. Furthermore, it refers to exploiting a pre-trained model as feature extractor, and taking advantage of features learned by a model trained on a larger dataset in the same domain. This is done by removing the last fully-connected layer, then instantiating a fresh fully-connected layer with an output corresponding to the number of our problem classes. The pre-trained model is “frozen” and only the weights of the classifier get updated during training. In this case, the convolutional base extracted all the features associated with each image, and you just trained a classifier that determines the image class given that set of extracted features.
-
Fine-tuning: In this approach, we not only replace and retrain the classifier on top of the ConvNet on the new dataset, but also we fine-tune the weights of the pretrained network by continuing the backpropagation. It is possible to fine-tune all the layers of the ConvNet, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a ConvNet contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the ConvNet becomes progressively more specific to the details of the classes contained in the original dataset.
VGG19
-
A fixed input size of 224 * 224. For a RGB images, the input network is changed to 224 * 224 * 3, as 224 is the height and the weight, and the 3 is the RGB channels.
-
A kernel of 3 * 3 with a stride size of 1 pixel.
-
A max pooling of 2 * 2 pixel windows with a stride of 2.
-
A Rectified linear unit (ReLu) as non-linearity function (the previous models used tanh or sigmoid).
-
Three fully connected layers, where the first two layers have a size of 4096, and the last layer have a 1000 channels, which is the number of classes in the ImageNet dataset.
-
The final layer contains a Softmax function.
Inception v-4
NASNet-A
Dataset
Experiments and results
Data pre-processing
Removing irrelevant frames
Splitting videos and resizing frames
Data augmentation
-
Rotation: Minority class images are rotated at an angle of zero, 40, 85, 125, 250, and 300.
-
Mirroring: Mirror the image along the x-axis and y-axis.
-
Shearing: The images were shifted at 40° in the counter-clockwise direction.
-
Padding: Padding 5px on each border, using the reflect mode, which pad with the reflection of image without repeating the last value on the edge.
Experimental setup
-
Scenario 1: The models VGG19, Inception v4, and NASNet-A are fine-tuned and used as a classifier.
-
Scenario 2: The model combines the three neural networks using ensemble learning.
Results and discussion
Tool | VGG19 | Inception v-4 | NasNet-A | Ensemble learning |
---|---|---|---|---|
Grasper | 97.89 | 96.54 | 97.32 | 97.70 |
Bipolar | 96.72 | 94.33 | 97.11 | 98.14 |
Hook | 99.83 | 99.70 | 99.89 | 99.91 |
Scissors | 87.59 | 80.84 | 90.06 | 94.54 |
Clipper | 97.65 | 93.67 | 98.54 | 99.51 |
Irrigator | 96.10 | 92.08 | 95.91 | 97.79 |
SpecimenBag | 95.21 | 93.94 | 96.35 | 97.29 |
Average (mAP) | 95.85 | 93.01 | 96.45 | 97.84 |
Tool | VGG19 | Inception v-4 | NasNet-A | Ensemble learning |
---|---|---|---|---|
Grasper | 96.45 | 95.16 | 96.31 | 96.88 |
Bipolar | 99.67 | 99.17 | 99.53 | 99.81 |
Hook | 99.90 | 99.79 | 99.91 | 99.93 |
Scissors | 98.76 | 98.11 | 99.18 | 99.64 |
Clipper | 99.82 | 99.37 | 99.89 | 99.95 |
Irrigator | 99.41 | 98.74 | 99.41 | 99.85 |
SpecimenBag | 99.62 | 99.39 | 99.56 | 99.81 |
Average (\(mA_{z}\)) | 99.09 | 98.58 | 99.11 | 99.41 |
Tool | M.Sahu [33] | EndoNet [32] | Amy.J [34] | Jo [35] | Lin [36] | Shi [12] | Our model |
---|---|---|---|---|---|---|---|
Grasper | 73.9 | 84.8 | 87.2 | 92.1 | 85.41 | 89.88 | 97.70 |
Bipolar | 40.8 | 86.9 | 75.1 | 82.3 | 90.36 | 90.52 | 98.14 |
Hook | 95.1 | 95.6 | 95.3 | 85.9 | 90.84 | 99.33 | 99.91 |
Scissors | 26.2 | 58.6 | 70.8 | 81.2 | 90.58 | 90.78 | 94.54 |
Clipper | 35.3 | 80.1 | 88.4 | 85.3 | 90.05 | 90.19 | 99.51 |
Irrigator | 33.2 | 74.4 | 73.5 | 82.9 | 87.42 | 89.62 | 97.79 |
SpecimenBag | 76.6 | 86.8 | 82.1 | 83.2 | 89.98 | 91.25 | 97.29 |
Average (mAP) | 54.44 | 81.0 | 81.8 | 84.7 | 89.23 | 91.65 | 97.84 |