DL is a subfield of machine learning that excels at analyzing complex data, such as images, videos, and audio. DL leverages neural network architectures with multiple interconnected layers to process and comprehend intricate patterns within data. These architectures allow DL algorithms to automatically identify and extract various levels of information from the input data. This is achieved through a hierarchical representation of features, where lower layers capture rudimentary details while higher layers progressively assemble these details into more complex, abstract representations. For instance, when applied to image analysis, DL algorithms can discern basic features like edges, corners, and textures in the lower layers. As data flows through the network, subsequent layers consolidate these simple features into more elaborate elements like shapes, objects, and context. This layer-by-layer learning enables DL models to recognize complex structures within images without explicit programming. In this section, we discuss various DL architectures and their use in analyzing thyroid nodules in US images.
Supervised DL algorithms are trained in labeled data, in which the correct label or class for each image is known beforehand. Supervised learning is a powerful approach for thyroid ultrasound image analysis, providing accurate and consistent nodule detection and classification, personalized patient care, efficient screening, and the potential for early detection. It complements the expertise of healthcare professionals and enhances the quality of thyroid-related healthcare services. In contrast, unsupervised DL algorithms do not require labeled data and can learn to identify patterns and structures in the data without any prior knowledge.
Most frequently, supervised learning algorithms are preferred for ultrasound image analysis. Supervised DL algorithms discover the relationship between input data and desired output in order to make decisions on new, unseen data. A common application of supervised DL algorithms is classification, where the objective is to determine the class or category of an input image. However, collecting a large dataset of labeled images is difficult, making the implementation of these techniques challenging.
In contrast, unsupervised DL algorithms do not require labeled data and can learn to identify patterns and structures in the data without any prior knowledge. These algorithms are frequently used for clustering, where the objective is to group similar data samples without prior knowledge of the groups.
Hybrid DL architectures combine elements of supervised and unsupervised learning, enabling models with greater flexibility and versatility. These architectures can be trained on both labeled and unlabeled data and are applicable to a variety of tasks. When there is a lack of labeled data or when the data is complex and difficult to accurately classify hybrid DL algorithms can be especially beneficial.
In this study, we are mainly focusing on supervised learning techniques. The subsequent portion of this section delves into various deep learning architectures employed to fulfill three fundamental tasks: classification, segmentation, and detection, all within the specific realm of analyzing thyroid US images.
6.2 Segmentation
Thyroid nodule segmentation is an integral part of ultrasound image diagnosis of thyroid diseases (Russ et al.
2017). In recent years, the segmentation of additional tissues related to the thyroid gland (Ma et al.
2022) has been the subject of investigation. A comprehensive review (Chen et al.
2020) focusing on a comparative analysis of 28 studies on CAD-based systems and DL demonstrates the evolution of thyroid gland and nodule segmentation until 2019. That review highlights the similarities and contrasts between the approaches and discusses the benefits and drawbacks of each system.
U-Net (Chu et al.
2021) is a DL architecture that was specifically designed for image segmentation tasks. It consists of a CNN with an encoder-decoder structure, where the encoder extracts feature from the input image and the decoder reconstructs a precise segmentation of the input image. U-Net has been shown to be highly effective and precise for ultrasound (US) image segmentation tasks involving thyroid nodules. This is because it can automatically learn and extract relevant features from the data (such as the shape and location of thyroid nodules) without requiring prior knowledge or manual feature engineering (Chu et al.
2021).
The Semantic Guided U-Net (SG-UNET) (Pan et al.
2021) was recently proposed in response to its success. The use of average pooling and leaky ReLU reduces noise and attenuates the unfavorable filter response. To reduce noise interference caused by the mirror structure of the U-Net, a side network accepts high-dimensional features and transforms them into one-dimensional semantic features. The suggested architecture outperforms U-Net and U-Net++ in terms of performance. In addition to fully automated systems, there are semi-automated systems, such as mark-guided U-Net-based segmentation systems (Lu et al.
2022). The network proposed by Chu (Lu et al.
2022) showed segmentation precision of 0.9785.
Sun's TNSnet (Sun et al.
2022) is a dual network comprising two sub-networks: a form network and a regional network. The form network oversees figuring out an object's general shape or form, while the regional network is in charge of figuring out the object's unique details and traits. Compared to conventional single-network architectures, this dual network architecture permits more precise and accurate object detection.
Recently, researchers used a two-stage network to detect a specific form of thyroid cancer, medullary thyroid carcinoma, which is the second most prevalent (yet rare) form of thyroid cancer (Pan et al.
2022). The segmentation map is generated using a coarse-to-fine segmentation network (C2F-SegNet), which is a combination of CoarseNet and FineNet. A classifier based on past knowledge is then used to classify the nodules. The backbone structure of the knowledge-based classifier network is Resnet-34, which has been pre-trained on ImageNet (Deng et al.
2009). The suggested network not only accurately differentiates the malignant nodule but also the papillary and medullary forms of thyroid cancer. The suggested network's segmentation architecture has a greater IoU and DSC performance than U-Net and U-Net++, while U-Net++ has a higher recall and precision. In contrast, the classification architecture surpasses ResNest34 and ResNest50 in all aspects of evaluation.
A weakly supervised model can help with the management of over-or under-segmentation of thyroid nodules (Yu et al.
2022). Semantic features are extracted using image-level classification labels. The nodule site is activated using a dual branch soft erase module (DBSM) and a scale feature adaptation module (SFAM). The edge self-attention module (ESAM) is implemented to handle blurred edges.
For further details regarding thyroid gland segmentation and thyroid nodule segmentation methods for medical ultrasound images, see the existing comprehensive review (Chen et al.
2020).
6.3 Detection
The detection task determines the location and identification of anomalies (e.g., lesions and tumors) and other anatomical objects (e.g., fetal standard plans, organs, tissues, etc.) in US image analysis. A new era in thyroid nodule detection has begun with the introduction of DL architectures such as region-based convolution neural networks (R-CNN) (Girshick et al.
2014), Faster R-CNN (Ren et al.
2015), and you only look once (YOLO) (Redmon et al.
2016).
Li et al. (
2018) proposed an enhanced version of Faster R-CNN for the detection of thyroid nodules. A spatial constraint layer is implemented to extract the characteristics of the surrounding region. Combining the shallow and deep layers of the networks allows identification of small hazy nodules. Standard, Faster R-CNN with a layer concatenation technique was unable to detect solid nodules with uneven borders. Buda et al. (
2019) suggests a three-part model. The first component is a Faster R-CNN trained to recognize capillaries using Resnet101 as its backbone. The recovered square image of the nodules is then sent to a second network. The second network is a multitask deep convolutional neural network that is utilized to detect whether a nodule is malignant. The third section focuses on risk classification using the ACR TI-RAD output. In comparison to the consensus of three radiologists, the proposed network generates similar results. In addition, the risk classification network improves the specificity of the referral for thyroid nodule biopsy.
Xie et al. (
2019) proposed three completely convolutional networks based on single-shot multi-box detectors (SSD) (Liu et al.
2016), namely SSD300, SSD300 cov3, and SSD512. The base model is a typical image classification model with a fully linked layer. A convolutional layer stack is applied to extract additional features. These layers downsample the photos to generate a diverse range of large and small features. At the end of the network, a convolutional predictor is added to produce a class score for certain feature maps. The suggested architecture's training loss function combines smooth L1 loss and class-weighted entropy loss. The SSD300 was unable to identify the tiny nodules. Sometimes, SSD300 cov3 and SSD512 identified false positives among smaller nodules. The SSD300 was unable to identify certain dense nodules. The application of SSD showcases their suitability for capturing both global and local features present in thyroid ultrasound images.
Song et al. (
2018) suggested a CNN with several cascaded architecture to detect thyroid nodules, based on SSD and the multi-box framework. The detection layer of the SSD is reconstructed by adding several convolution layers and anchor-generated layers to extract local and global features. The two-stage network designed to localize and classify nodules in a pyramidal structure. After the first localization, the potential region of interest is input into a spatial pyramid supplemented by CNNs to achieve adequate recognition of the thyroid. The architecture was constructed using a huge dataset. The architecture was unable to detect excessively small or large nodules. To combat the lack of high-quality data, the authors proposed investigating the use of transfer learning as future work. Multitasking within deep learning models involves training a single architecture to perform multiple related tasks concurrently. This approach offers advantages such as efficient resource utilization, shared information across tasks, and regularization against overfitting. However, it introduces trade-offs that necessitate careful consideration. Efficient resource allocation is crucial to prevent compromising task-specific performance, and increased model complexity can hinder interpretability and optimization. Task interference, stemming from dissimilarities between tasks or conflicting patterns, must be balanced. Striking a balance between resource efficiency and maintaining task-specific performance is pivotal.
Zaho et al. (
2021) recently presented a two-part model. SSD with ResNet50 as backbone is used for thyroid nodule detection. The last few layers of the ResNet50 architecture are replaced with a residual block and a few more blocks. To account for the various aspect ratios of the thyroid nodule, anchor boxes between 0.3 and 3 are produced. Nodules are clipped with a 50-pixel margin, and a 256 × 256 image is sent to the classification network. For classification, A-ResNet50-F, a modified version of ResNet50, was used. A-ResNet50-F is made up of an additional attention block and a fire block. The decision to clip nodules with a 50-pixel margin and process them as 256 × 256 images is a reasonable preprocessing step, likely aiding in standardizing the input size for the classification network. The introduction of A-ResNet50-F, which includes an attention block and a fire block, is intriguing as it implies a focus on capturing crucial nodule-specific features and optimizing the network for the classification task. The suggested architecture demonstrates a higher average precision than SSD (backbone vgg16), Faster R-CNN (backbone ResNet50), and YOLO v3 (with Darknet53). For semi-solid nodules with irregular edges, the identification system detected a somewhat larger bounding box. However, the classification network has beaten well-known networks such as VGG16 and Inception v3. In terms of classification, the proposed technique also beats expert radiologists.Mask R-CNN (He et al.
2017) is a two-stage network repeatedly used for object detection and segmentation. The increasing popularity of Mask R-CNN has paved the way for a new direction in thyroid nodule detection (Abdolali et al.
2020). The R-CNN mask combines classification, segmentation, and localization loss, which has been shown to be advantageous for the detection of thyroid nodules. The combined loss function favors detection over-segmentation. Resnet50 beat all other pre-trained backbone networks, including Resnet, U-Net, Mobilenet, and Inception V2, due to the smaller data size. Comparing the performance of mask R-CNN to R-CNN with the new loss function, which is faster, demonstrates a focus on both accuracy and efficiency. However, the noted limitation of the network's suboptimal performance in primarily solid nodules reveals an area that requires attention. The absence of thyroid parenchyma and microcystic nodules in the dataset used is also a relevant consideration for the generalizability of the results. The future direction outlined by the authors, which involves reducing the complexity of mask R-CNN and addressing overfitting, provides a proactive approach to refining the model. Simplifying the architecture and enhancing its generalization capabilities can contribute to the model's robustness and usability in diverse clinical scenarios. As a solution to inaccurate localization, Zheng et al. (
2022b) proposes a more efficient cascade mask R-CNN. The structure is composed of two stages. In the first stage, an ROI is determined using FPN (Lin et al.
2017), the backbone, and RPN. The subsequent stage is the multi-cascading detector network, which provides a more specific location for the nodule. Using a modified version of L1 loss, the gradient of the easy samples is increased, providing a balance between the easy and difficult samples during the training phase. Experimental results demonstrate that using more than three cascading networks affects performance. A "soft NMS" is used to reduce the likelihood of an object being missed due to non-maximal suppression (NMS). The baseline with soft NMS did not improve localization on its own. Using a novel detector with five convolutional layers and one fully linked layer, together with L1 regularization, results in a considerable improvement in localization. The authors claim that the proposed network is more precise than Mask R-CNN, Faster R-CNN, and Libra R-CNN (Pang et al.
2019) for medium and small nodules. The proposed network has a lower true negative rate than Mask R-CNN.
Wang et al. (
2019) proposed a modified version of YOLO to identify thyroid nodules with greater precision. YOLO is a one-stage model that decreases temporal complexity and enhances precision. Detecting thyroid nodules requires more precision than quickness. YOLO v2 and Resnet v2-50 are combined to create this model. The feature maps from the deep layers are blended with those from the shallow layers to generate more accurate feature maps. This strategy capitalizes on the diverse information captured across different layers, aiming to generate more accurate and comprehensive feature maps. The suggested network is quick and claims to reach the same sensitivity, positive predictive value, and accuracy as the radiologist. The dataset is divided into two classes: nodules and without nodules. The nodule classes are also classified as benign or malignant. The dataset's division into nodules and without-nodules classes, along with the classification of nodule classes into benign and malignant, highlights a structured and comprehensive dataset setup. This approach likely contributes to the model's ability to differentiate between various nodule types effectively. The network is effective in detecting images without nodules; in only two instances the network does not detect the absence of a nodule. The network is effective in localizing cancerous nodules. Most malignant thyroid cases in the dataset are papillary thyroid carcinoma. The work does not explore the improved performance of the radiologist using the suggested network.
Redmon and Farhadi developed YOLO v3 (Redmon and Farhadi
2018), which is renowned for having greater precision than its predecessors. Thus, the researchers developed YOLO-HRNet by using YOLO v3 as the base model (Zhang et al.
2021). It consists of five distinct layers: input, downsampling, feature extraction, multi-scale detection, and prediction. The downsampling layer is used to scale the network's parameters. The incorporation of HRNet (Wang et al.
2020), known for extracting high-level semantic features, is a strategic choice to enhance the network's ability to understand and capture complex visual characteristics. This choice aligns with the objective of achieving superior performance over previous architectures. In all respects, YOLO-HRNet outperforms the Yolo V3 baseline. However, SSD beats YOLO-HRNet in terms of speed, while faster R-CNN with Resnet50 as its backbone outperforms the network in terms of precision. The network generates larger bounding boxes for small hypogenic nodules with fuzzy edges. Redmon and Farhadi anticipate more tagged medical data and the use of thyroid antibodies for detection.
Song et al. (
2022) proposed a feature-enhanced dual branch network (FDnet) to detect thyroid nodules. The detection network includes a semantic segmentation network and a feature enhancement method. A more precise localization is achieved by using an iterative training technique that blends the ground truth with the branch result. Resnet and FPN serve as the backbone structures. An RPN is utilized to target the nodule region. Using a pseudo-labeling system, the pseudo-labels generated in an epoch are used as the ground truth in the subsequent epoch. Given their proven effectiveness in feature extraction tasks, using ResNet and FPN as backbone structures is a prudent choice. Additionally, incorporating an RPN (Region Proposal Network) to focus on the nodule region suggests a strategic approach to narrowing the areas of interest. The introduction of a pseudo-labeling system is an intriguing aspect of the approach. This strategy, where pseudo-labels generated in one epoch are used as ground truth in the next epoch, demonstrates an innovative approach to augmenting the training data. It's particularly impressive that the proposed network exhibits strong performance even when trained on a small dataset, potentially making it more accessible and practical for real-world medical applications. It outperforms the most widely used R-CNN designs. CornerNet (Law and Deng
2018) and DETR (Zhu et al.
2020) surpass FDNet in mAP and F1 score, despite requiring 240 and 150 training epochs, respectively. In addition to saving processing time, Song utilized a pseudo-labeling scheme to construct further unsupervised schemes.
Shahroudnejad et al. (
2021) developed a one-stage FPN model, known as TUD-Net, to gather multiscale characteristics from multiple resolution feature maps. The suggested architecture consists of three parallel layers of RSU designed to replace a CNN for classification and regression tasks. The model recognizes heterogeneous, large, and hypoechoic nodules successfully. In terms of average precision, the model outperforms models like Retinanet and Faster R-CNN. TUD-Net performs well under different IoU thresholds.
Lu et al. (
2022) recently applied a GAN-guided CAM-based technique to the diagnosis of thyroid nodules. The architecture includes class activation mapping (CAM) (Zhou et al.
2016) to identify discriminatory features in thyroid nodules, and a GAN (Yi et al.
2019) guided deformable module to capture finer-grained distinctions between benign and malignant nodules. CAM supplies the saliency map to the outer layers of the deformable network also known as deformable convolution layers. The CAM and the deformable module together successfully detect the subtle differences between the nodules. Real samples are generated by augmenting the ground truth mask using prior knowledge. The number of augmentation techniques used are limited and can be easily captured by the GAN. Thus, a simple augmentation technique such as zooming at the nodule can deceive the discriminator. If for a particular application the nodule shape is not relevant for diagnosis, or the boundaries are regular, the GAN may prevent the deformable module from capturing the meaningless features from the images.
Integrating the CAM mechanism to identify discriminatory features in thyroid nodules is a strategic step. The CAM's ability to generate saliency maps contributes to a focused understanding of relevant areas within the nodule images. A notable advancement is the subsequent collaboration with a GAN-guided deformable module to capture nuanced distinctions between benign and malignant nodules. By incorporating the deformable convolution layers with the saliency information from the CAM, the model seems well-equipped to discern subtle differences between nodules. The inclusion of real sample generation through augmented ground truth masks, based on prior knowledge, highlights a data augmentation strategy. This approach likely contributes to a more diverse training dataset, which can be particularly useful in addressing potential overfitting and enhancing the model's generalization capabilities.
Based on the research work cited in this review, it is evident that there is a need for standardized evaluation matrices. DL networks using the same base model have been evaluated inconsistent performance matrices; for example, see (Xie et al.
2019; Song et al.
2018). It is very difficult to draw any kind of comparison between the networks by just looking at the results. Apart from that, there is a need for a standardized data set for testing all the networks.