Skip to main content
Top
Published in: Neural Computing and Applications 11/2024

Open Access 13-01-2024 | Original Article

CellSegUNet: an improved deep segmentation model for the cell segmentation based on UNet++ and residual UNet models

Author: Sedat Metlek

Published in: Neural Computing and Applications | Issue 11/2024

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Cell nucleus segmentation is an important method that is widely used in the diagnosis and treatment of many diseases, as well as counting and identifying the cell nucleus. The main challenges when using this method are heterogeneous image intensities in the image, overlapping of cell nuclei, and noise. In order to overcome these difficulties, a hybrid segmentation model with attention block, CellSegUNet, is proposed, inspired by the advantageous points of UNet++  and Residual UNet models. With the proposed attention mechanism, semantic gaps that may occur are prevented by evaluating both horizontal and vertical features together. The serial and parallel connection of the convolutional blocks in the residual modules in the CellSegUNet model prevents data loss. Thus, features with stronger representation ability were obtained. The output layer, which is, especially proposed for the CellSegUNet model, calculated the differences between the data in each layer and the data in the input layer. The output value obtained from the layer level where the lowest value comes from constitutes the output of the whole system. At the same depth level, CellSegUNet versus UNet++  and ResUNet models were compared on Data Science Bowl (DSB), Sartorius Cell Instance Segmentation (SCIS), and Blood Cell Segmentation (BCS) datasets. With the CellSegUNet model, accuracy, dice, and jaccard metrics were obtained as 0.980, 0.970, 0.959 for the DSB dataset, 0.931, 0.957, 0.829 for the SCIS dataset and 0.976, 0.971, 0.927 for the BCS dataset, respectively. As a result, it is predicted that the proposed model can provide solutions to different segmentation problems.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

In the diagnosis of many diseases in the field of medicine, a correct diagnosis can be made using pathological images. The cell nuclei in pathological images are also examined in the diagnosis of many diseases such as Alzheimer's and diabetes. In the literature, many different diseases can be diagnosed by only looking at the cell nucleus [1]. This is why cell segmentation is a very important issue in the field of medicine. The study of cell nuclei is also important in the pharmaceutical industry. However, there are some difficulties in the segmentation of the cell nucleus. These include heterogenic image intensities, overlapping of cell nuclei, and noise in the image [2]. In order to overcome the stated difficulties, there are many image-processing algorithms developed based on deep learning in the literature [3].
When the literature on cell segmentation is examined, UNet-based segmentation algorithms are the most popular and successful of the methods used [46]. The basis of UNet model is based on the encoder-decoder blocks. This structure is used in many different areas in segmentation models and modern semantics [7, 8].
In order to understand the success of the UNet model, the operations performed in the encoder block should be examined first. In this block, the image is transferred to subnets by subjecting it to convolutional processes by means of an encoder. Thus, deep, semantic, and coarse-grained feature maps are obtained. The second factor that makes the UNet model successful is the decoder block. In this structure, a new image with the same dimensions as the input image is obtained by combining the low-level fine-grained feature maps obtained from the decoder block. In the architecture used, images obtained from subnets can be effective in determining the details of objects. Also, layer skips or stride operations are performed on the same image. This directly affects the success of segmentation [9]. There are some gaps that can cause problems in the encoder-decoder blocks that form the basis of the UNet model. These gaps are briefly summarized as follows:
  • The basic UNet model may give poor results on cell nuclei images that have somewhat complex backgrounds or overlap. Therefore, there is a need for methods that can provide high segmentation success in complex images [9].
  • Because the UNet model can be used in many different applications, it can be run at varying depths on datasets of varying size. As a result, the depth of the encoder-decoder architecture may differ from application to application. Although high success is achieved, especially in very deep networks, the logarithmic increase in the number of processes can be a significant disadvantage [10]. In addition, it is a method that requires a large number of parameters. This situation causes a very long time and process costs, especially in training the model.
  • In some cases, important features can be lost due to step connections used in encoder-decoder blocks. Different data are obtained when the number of strides is set to two or more. When the newly obtained data and the data of the same size are combined in the decoder structure, a very different situation can occur semantically, although it is thought that feature maps with the same dimensions are combined. Many features are lost during this concatenate process. These lost features can directly affect success negatively.
  • Data loss can occur due to a gap between the upper and lower layers in the encoder-decoder blocks or too much compression. As a result, very large semantic gaps can occur between the lower and upper layers.
Due to the gaps mentioned above, the development of UNet-based segmentation algorithms that can provide better performance has been the subject of research in the literature [3]. The most popular UNet-based algorithms developed in the literature are the UNet++ and Residual UNet (ResUNet) models. However, these models have some disadvantages. In the UNet++ model, details such as spatial geometric properties of objects are not very clear. In addition, when deep network structures are created with UNet and UNet++ based architectures, the system may slow down toward the middle layers. This is because gradients are diluted farther from the output of a network where the error is calculated, resulting in slower learning for very far weights. There is also the risk of ignoring layers where abstract features are represented [3].
Despite these disadvantages, although UNet++ and ResUNet segmentation models are widely used in the literature today, memory and time consumption increase significantly with the increase in image size. This is a major disadvantage for real-time segmentation of relatively large sizes of computed tomography (CT) and ultrasound images [11]. For this reason, UNet-based segmentation algorithms are still being developed in the literature [7]. In the study, a new model has been proposed for the solution of the problems mentioned up to this part. This model, called CellSegUNet, was developed based on the positive aspects of UNet++ and ResUNet algorithms.
In the CellSegUNet model, it is aimed to prevent data loss that may occur in convolution blocks by using residual modules based on the ResUNet model. Inspired by the node structure used in the UNet++ model, an attention block structure that can be used at each depth level has been created. In the created attention block, the data at the same depth as itself and the data in the lower and upper layers were processed simultaneously. Thus, the semantic gap between the lower and upper layers is prevented.
An output layer design specific to the designed model was made and added to the model. With the output layer added to the model, the output data produced at each level of the model can be evaluated. As a result of this evaluation, the output value with the least error value constitutes the output of the whole system.
This new design, which contributes to the literature, has been tested on two different cell nucleus datasets, together with the UNet++ and ResUNet models. When the results of the experimental study were evaluated, it was detected that the CellSegUNet model gave better results than the other two models. The main contributions of this article to the literature are briefly summarized as follows:
  • In the CellSegUNet model, semantic gaps that may occur are prevented by collecting both horizontal and vertical features with the attention mechanism.
  • By connecting the convolutional blocks in the residual modules in the CellSegUNet model in serial and parallel, data loss is prevented. Thus, features with a stronger representation ability are obtained.
  • In the CellSegUNet model, unlike the inspired UNet++ model, multiple node-like structures are not used between the same level of encoder and decoder. As a result, a lightweight segmentation model was obtained by reducing the number of operations. Thus, the success level of the model was increased with the reduction in the number of nodes in the study and the attention block proposed instead of the node.
  • Unlike the UNet-based models inspired by the study, an output layer is available in CellSegUNet. In this layer, the data from each layer of the CellSegUNet model and the data from the input layer are transferred to the difference module and the difference is calculated. The output value obtained from the layer level where the lowest value comes from, formed the output of the whole system. Thus, it is ensured that the system produces output with the least error value.
  • The model proposed in the study provided higher performance measurements when tested on three datasets described in the article at the same depth level as the UNet++ and ResUNet models in the literature.
After this section, the general introduction of the article was made in the following order. In the second section, studies in the literature that can form the basis of our study on segmentation are presented. The third section provides information about the proposed CellSegUNet, UNet++ , and ResUNet models. In the fourth section, firstly, the datasets used, the performance criteria commonly used in the literature, and then the information about the experimental study are presented. In the fifth section, the performance results of the experimental study carried out with CellSegUNet, UNet++ , and ResUNet models are shared comparatively. In the last section, the overall success of the study is evaluated and information about future studies is given.
Although it seems like an easy process when cell nucleus segmentation is done by expert health personnel, it is a subject that needs to be detailed when it is desired to be done by computers. When the studies on this subject in the literature are examined, it is seen that the first studies were based on morphological operations [12, 13]. Morphological operations-based methods have some disadvantages. The main disadvantage is that in order to determine the boundaries of the cell from the image, many features have to be examined one by one and different operations have to be performed for each feature. So many complex operations can be performed with convolutional neural networks and deep learning models today [1421]. When searching the literature for studies based on convolutional neural networks, the fully convolutional network (FCN) method stands out. It can be seen that good results can be obtained in the segmentation processes using the FCN method [22, 23]. In contrast, vanishing problems become more pronounced as FCNs are also drilled deeper. However, when a deeper network is created, it is often necessary to have more training examples in order to reduce overfitting [24]. The UNet segmentation method is the leading method developed to solve these problems [25]. A classical UNet is a segmentation algorithm based on the FCN structure and on which the encoder-decoder structure is added. This algorithm has been developed as an auxiliary decision support tool in many different biomedical applications, from the detection of different cancer types to blood cell segmentation [2628].
There are also different segmentation algorithms called SegNet, V-Net, and DeConvNet in the literature [29]. The encoder-decoder structures used in these algorithms are more sensitive than the classical UNet model. However, the high-resolution pixels of the image is lost in the encoder stage [30]. For this reason, many UNet models with different encoder structures have been developed in the literature [3133]. The studies on UNet and ResUNet models, which are also taken as a basis in the study, are presented below in general;
Huang et al. have achieved vessel segmentation, which is difficult even for expert healthcare professionals, using the 3D-UNet structure they developed. For this, they used Sliver07 and 3Dircadb general datasets. In their study, they obtained average dice and sensitivity performance values of 75.3% and 76.7%, respectively [25].
For white blood cell segmentation, Lu et al. developed a new deep learning algorithm (WBC-Net). The algorithm developed consists of a combination of convolution blocks and classical residual blocks. They used different skip paths to combine multi-scale image features with this method. As a result of this process, they obtained better features [34]. This approach has inspired our work as well.
Zhao et al. developed an application to automatically segment and classify lung nodules using CT images. They preferred a patch-based 3D U-Net structure in their studies and achieved a 95.4% success rate [35].
Zhang et al. developed a multiple-supervised residual network (MSRN)-based architecture for osteosarcoma image segmentation in their study. In their practice, they used CT images. They also compared the developed model with the FCN and UNet models. Their developed model obtained 89.22% dice, 88.74% sensitivity, and 0.9305 F1 measure performance results [36].
Kıran et al. developed the DenseRes-UNet model for the segmentation of clustered cell nuclei in multi-organ histopathology images [37]. They propose a new connection that helps reduce the semantic gap between the encoder and decoder paths. They developed a new model from the connections between this connection and the encoder-decoder paths.
Ahmed et al. developed a binary attention-based UNet architecture with the ResNet encoder block on histology images [38]. Their developed model is applied in more than 20 cancer regions. They plan to apply the model they developed to different datasets in the future.
Gu et al. developed a new context encoder network (CE-Net) to capture higher level information and preserve spatial information for 2D medical image segmentation. Their developed model consists of three main components. These are the feature encoder module, the content extractor module, and the feature decoder module [39].
Han et al. inspired by convNeXt, developed a new UNet model based on the classical UNet, which can achieve promising results with few parameters. Their application has yielded good results in segmentation. It is an important achievement that they achieved these results with few parameters [40]. One of the sources of motivation for our study was this study.
Huang et al. have developed a new UNet 3+ architecture based on the UNet++ model, leveraging full-scale skip connections and deep controls. In their application, they first identified the regions with and without organs in the image. Afterward, they performed the segmentation process with UNet3+. With their developed model, they prevented the segmentation process in non-organ regions [25].
Li et al. developed the Residual-Attention UNet++ model in their study. They tested the model they developed on three medical image datasets such as skin cancer, cell nucleus, and coronary artery in angiography. With the model they developed, they suppressed the background area that was irrelevant to the segmentation task and increased the weight of the target area. Thus, they achieved a higher segmentation success compared to other studies in the literature. Although the model they developed has very limited similarities with the logic proposed in the study at the basic level, the most important difference between it and the proposed model is the output layer. In addition, the architectural structure of the proposed model is also quite different. [41].
Haniabadi et al. performed a comprehensive comparative analysis between traditional and deep learning-based approaches for medical image segmentation. In their analysis, he evaluated traditional methods in 4 different categories: intensity-based, boundary-based, model-based, and region-based methods. They considered the simplicity and computational efficiency of traditional methods as an advantage. However, they pointed out that they are limited in terms of accuracy and robustness. In addition, they stated that traditional methods require manual intervention as a significant disadvantage. Moreover, in their analysis, he mentions that UNet, VNet, and CNN-based deep learning approaches are promising. The model proposed in the study based on UNet and Residual UNet models is also parallel to Haniabadi et al.’s views [42].
In their study, Lan et al. developed a method to solve the problems encountered due to low contrast, large differences in cell morphology, and scarcity of labeled images. In his work, he developed a UNet++ based model to obtain multi-scale features. They used atrous spatial pyramid pooling and multi-sided output fusion (MSOF) strategy in the model they developed. They used only the BCISC dataset to experimentally measure the performance of their work. They did not conduct any experimental studies on any dataset other than the BCISC dataset. They state that the dice and jaccard values obtained as a result of the experimental studies provide advantages compared to other studies in the literature [43].
As can be seen in the literature research, new methods with different encoder-decoder structures with expandable depth have been developed to obtain better results than existing segmentation models. This study, it is aimed to develop a model with a new encoder-decoder structure that can achieve high segmentation success, inspired by the mentioned points.

3 Methods

In this study, deep learning-based UNet++ and ResUNet segmentation models were used. These models were run with similar parameters in the study. Thus, the advantages and disadvantages of both models were examined under equal conditions. As a result of the investigations, a new architecture was designed based on the advantageous points in these two models. To designed architecture also added an output layer, which is not present in other models. This proposed architecture was also named CellSegUNet. Thus, in the study, three different models have been applied in total, together with a new model called CellSegUNet.

3.1 UNet++ 

The UNet++ model is a segmentation algorithm based on the UNet segmentation model. UNet is a segmentation algorithm that was initially used in medical image processing applications and has been preferred in different image segmentation applications such as remote sensing in recent years [9]. UNet is a deep learning model based on encoder-decoder structure.
In this model, as shown in the decoder block in Fig. 1a, only feature maps of the same size are combined while the information in \(X_{En}^{2.0}\) is transferred to \(X_{De}^{2.0}\). The same is true for other nodes. For this reason, uniform information about the segmentation region is presented. As a result of this process, it cannot define the location and boundary of the learning objectives. The UNet++ segmentation algorithm has been developed to solve the aforementioned problem.
While UNet connects the data of the same scale in the encoder and decoder blocks, the data from the intermediate node are combined using nested dense skip connections in the UNet++ model. Data from different nodes in the merge process are on the same scale. Equations between Eqs. (1)–(4) were used in the merging process. These equations are summarized as shown in Fig. 1b.
$$X_{De}^{1.4} = X_{Me}^{1.3} + X_{Me}^{1.2} + X_{Me}^{1.1} + X_{En}^{1.0} + X_{De}^{2.3}$$
(1)
$$X_{Me}^{1.3} = X_{Me}^{1.2} + X_{Me}^{1.1} + X_{En}^{1.0} + X_{Me}^{2.2}$$
(2)
$$X_{Me}^{1.2} = X_{Me}^{1.1} + X_{En}^{1.0} + X_{Me}^{2.1}$$
(3)
$$X_{Me}^{1.1} = X_{En}^{1.0} + X_{En}^{2.0}$$
(4)
A sub-feature map obtained after feature extraction in the encoder layer presented in Fig. 1b is expressed with \(X_{En}^{1.0}\), \(X_{En}^{2.0}\), \(X_{En}^{3.0}\), \(X_{En}^{4.0}\), and \(X_{En}^{5.0}\). At each layer, these feature maps are subjected to the pooling function and feature maps in a lower layer are produced. By repeating this process, the \(X_{En}^{5.0}\) feature map shown in Fig. 1b was finally obtained.
Feature maps at different levels of the UNet++ model contain different detailed information for the same image. As with \(X_{De}^{1.4}\), the shallower the mesh, the smaller the receptive field, whereas the stronger the geometric detail representation ability of the extracted shallow feature information.
However, the more it can capture the spatial information of the object, such as the contour and the edge. Additionally, as with the \(X_{De}^{1.4}\), the deeper the mesh, the larger the receptive field that determines the object features. However, in this case, the details of the objects, such as spatial geometric features, are not obvious. In contrast, feature information has a strong semantic representation ability [6].
In this article, the UNet++ model is tested to fully extract multi-scale features information of cell structures. Shallow feature information and deep feature information are combined as in Eqs. (1)–(4).

3.2 ResUNet

ResUNet is one of the UNet-based segmentation models based on residual architecture. ResUNet is a model developed to overcome the difficulties arising from training deep neural networks. One of the difficulties in the mentioned training is that deep learning-based segmentation models consist of too many layers. A large number of layers usually forces the system to memorize. As a result, system performance decreases. Another reason for decreased system performance is the decreasing gradients in the weight matrix. In general, this occurs as a result of the loss of features in deeper neural networks. The Residual UNet architecture aims to solve the stated problems [44]. Residual is a structure of skipping connections that takes the feature map from one layer and connects it to another, deeper layer. This situation is shown in detail in Fig. 2a. The structure shown provides better protection of the network's features in deep neural networks. As a result, the performance increase is achieved [45].
Three sequential ResNet blocks with skip connections are shown in Fig. 2b. In Fig. 2b, in ResUNet blocks, in each block in the network, the first convolution operation is applied to the input layer and added to the output from the second convolutional layer using a skip connection. The ResNet block structure, which has been used in the literature in recent years, is the triple block structure shown in Fig. 2b [46].
The skip connections shown in Fig. 2 are implemented in UNet encoder-decoder structures. The use of skip connections in the ResUNet structure also helps reduce the gradient vanishing problem. As a result, it allows the design of UNet models with deeper neural networks. Each residual unit is represented by Eqs. (5)–(6).
$$y_{i} = h\left( {x_{i} } \right) + {\mathcal{F}}\left( {x_{i} ,w_{i} } \right)$$
(5)
$$x_{i + 1} = f\left( {y_{i} } \right)$$
(6)
Shows the input and output of the residual unit \(x_{i}\) and \(x_{i + 1}\) used in Eqs. (5), and 6. \({\mathcal{F}}\), \(f\) and h represent residual, activation, and identity mapping functions, respectively. The preferred ResUNet structure in the study has been seen to be used in many applications from nuclei segmentation to retinal vessel segmentation, and it is a segmentation algorithm of also interest in the literature [4750].
The ResUNet model used also in the study is presented in Fig. 3. The content of ResBlock used in this model is presented in Fig. 3b, and the content of the pooling layer is presented in Fig. 3c. Since the depth level is used as four in the UNet++ model, the depth level of the ResUNet model presented in Fig. 3 is also determined as four.

3.3 Proposed CellSegUNet

In this section, the CellSegUNet method developed based on deep learning-based segmentation algorithms for cell nuclei segmentation is presented in general. The CellSegUNet model consists of the encoder, decoder, residual module, attention module, multiply module, difference module, min pooling layer, and output layer.
The encoder and decoder blocks of the CellSegUNet model are similar to the classical UNet architectures, but the internal structure is quite different. At the beginning of these differences is the use of residual structures in the blocks used consecutively and in parallel. As seen in the research, residual blocks can be connected very tightly or sparsely [11, 51, 52]. However, in both cases, it can cause problems in terms of system performance. If a tight residual connection is made, the computational and memory cost increases greatly, depending on the size of the image. On the other hand, semantic gaps may occur even if sparse connections are made. Therefore, unlike the literature, residual connections were kept at an optimum level in this study.
UNet++ models were also tested on the datasets used in the study. The effect of the node structures in the UNet++ structure on the success was also examined in the study. As a result of this examination, it has been seen that the high number of nodes in the same layer can negatively affect both the success and the process cost. For this reason, an attention module has been proposed in the study, similar to the node structure used in UNet++ , but with different content.
The proposed attention module has three inputs and one output, as shown in Fig. 4. Two of these inputs are inputs 1 and 2 in Fig. 4. These inputs are obtained from the input and output of the same level B_BlockEncoder. This situation is expressed by Eqs. (7)–(8). The input of the B_BlockEncoder is obtained as a result of down sampling the output of the A_BlockEncoder, which is at a higher level itself.
$${\text{A}}\_{\text{BlockEncoder}}_{{i_{{{\text{next}}}} }}^{{{\text{out}}}} = {\text{ DownSampling}} \left( {{\text{A}}\_{\text{BlockEncoder}}_{i}^{{{\text{out}}}} } \right)$$
(7)
$${\text{Attention}}\_{\text{Module}}_{1}^{{{\text{in}}}} = {\text{B}}\_{\text{BlockEncoder}}_{i}^{{{\text{in}}}} = {\text{A}}\_{\text{BlockEncoder}}_{{i_{{{\text{next}}}} }}^{{{\text{out}}}}$$
(8)
$${\text{Attention}}\_{\text{Module}}_{2}^{{{\text{in}}}} = {\text{B}}\_{\text{BlockEncoder}}_{i}^{{{\text{out}}}}$$
(9)
Another input of the attention module is input number 3 in Fig. 4. This situation is expressed by Eqs. (10)–(11). This input is basically obtained from the output of the C_BlockEncoder, which is located at a lower level, indicated by the number 5. However, since the dimensions of the obtained matrix are smaller, it is transferred to the multiply module to increase the size. Then it is transferred as input to the attention module. The output of the attention module is calculated by Eq. (12). The output of the attention module is sent to both the D_decoder, which is at the same level, and directly to the output layer.
$${\text{Multiply}}\_{\text{Module}}_{i}^{{{\text{out}}}} = {\text{Multiply}}\_{\text{Module}}_{i}^{{{\text{in}}}} \left( {{{C}}\_{\text{BlockEncoder}}_{i}^{{{\text{out}}}} } \right)$$
(10)
$$Attention\_Module_{3}^{in} = Multiply\_Moduler_{i}^{out}$$
(11)
$${\text{Attention}}\_{\text{Module}}_{1}^{{{\text{Out}}}} = {\text{Concat}}\left( {{\text{Attention}}\_{\text{Module}}_{1}^{{{\text{in}}}} ,{\text{ Attention}}\_{\text{Module}}_{2}^{{{\text{in}}}} ,{\text{ Attention}}\_{\text{Module}}_{3}^{{{\text{in}}}} } \right)$$
(12)
The point to be noted here is that the attention module processes data from its upper level as indicated by number 1, from its own level as indicated by number 2, and from a lower level as indicated by number 3. This situation has been proposed as a very effective solution to both minimize the semantic gap encountered in UNet++ models and reduce the process cost.
In addition, an output layer has been added to the CellSegUNet model, as shown in Fig. 5, to evaluate the outputs obtained from each level, unlike the literature. The dimensions of the image transferred from the input level and the output values obtained from each level are different from each other. For this, a widening is applied to the data obtained from the lower levels. The difference between the expanded images and the input image is calculated in \({\text{Dif}}_{N}\) blocks. The calculated difference value and the image of this value are transferred to the \(F_{n}\) block. This situation is expressed by Eq. (13).
$$F_{N} = {\text{Dif}}_{N} = \left\{ {\begin{array}{*{20}c} {{\text{Different}}\left( {{\text{Input}}_{{{\text{original}}}} ,{\text{UpSample}}_{\frac{1}{n}} } \right), or} \\ {{\text{Different}}\left( {{\text{Input}}_{{{\text{original}}}} ,{\text{BlockEn}}_{1} } \right), or} \\ {{\text{Different}}\left( {Input_{{{\text{original}}}} ,{\text{BlockDe}}_{1} } \right), or} \\ \end{array} } \right\},\;\;\;\; \begin{array}{*{20}c} { N = 1, \ldots ,9 } \\ {n = 2,4,8,16} \\ \end{array}$$
(13)
Features selection is made in the \(F_{n}\) blocks in the min pooling layer to both increase the performance of the proposed model and reduce the overall computational cost. The feature used here is the difference value between the result obtained from each level and the image presented to the input of the system. In this way, a feature map is created with \(F_{n}\) data pairs. The lowest difference value within these values is determined by min pooling. The output of the level that provides the lowest difference value also creates the output of the whole system. This situation is expressed by Eq. (14).
$${\text{Out}} = {\text{MinPooling}}\left( {F_{N} } \right), \;\;\;\;N = 1, \ldots ,9$$
(14)
In the next part of the section, the general working principle of the model proposed in the study is presented in detail. In the study, the steps of the proposed model are exemplified in the SCIS dataset in order to avoid repetition of the narrative. The images in the SCIS dataset are 704 × 520 in size. Since matrix operations will be performed consecutively on the proposed structure, the images have been converted to 512 × 512 dimensions for ease of processing. These images are first transferred to the top layer of the encoder structure presented in Fig. 6a.
At the top layer, there is the Block_En_1 residual module. This residual module consists of convolution blocks in itself. Convolution, batch normalization, and ReLU operations are performed in the convolution block used in this structure, respectively. In the convolution operation, the two-dimensional convolution of the image and filter matrices is calculated. This process is done with Eq. (15).
$${\text{Conv}}I\left( {x,y} \right) = f*I\left( {x,y} \right) = \mathop \sum \limits_{{{\text{d}}x = - a}}^{a} \mathop \sum \limits_{{{\text{d}}y = - b}}^{b} f\left( {{\text{d}}x,{\text{d}}y} \right) I\left( {x - {\text{d}}x,y - {\text{d}}y} \right)$$
(15)
\({\text{Conv}}I\left( {x,y} \right)\), \(I\left( {x,y} \right)\), and \(f\) which are the symbols defined in Eq. (15), represent the filtered image, the original image, and the filter kernel, respectively. The output of the convolution process is transferred to the batch normalization process as input. The batch normalization value was calculated using the formulas in Eqs. (16)–(19).
$$\mu_{B} = \frac{1}{M}\mathop \sum \limits_{m = 1}^{M} {\text{Conv}}I_{m}$$
(16)
$$\sigma_{B}^{2} = \frac{1}{M}\mathop \sum \limits_{m = 1}^{M} \left( {{\text{Conv}}I_{m} - \mu_{B} } \right)^{2}$$
(17)
$${\text{Conv}}I_{m}^{\prime} = \frac{{{\text{Conv}}I_{m} - \mu_{B} }}{{\sqrt {\sigma_{B}^{2} + \varepsilon } }}$$
(18)
$$y_{m} = \gamma {\text{ConvI}}_{m}^{\prime} + \beta$$
(19)
The batch normalization (BN) result of a \({\text{Conv}}I_{m}\) _m input defined in the range m ∈ [1,2,3,4,…,M] is the \(y_{m}\) value. In the study, this M value was used as 64. The terms \(\mu_{{\text{B}}}\), \(\sigma_{{\text{B}}}^{2}\), \({\text{Conv}}I_{m}^{\prime}\), \(y_{m}\) defined in Eqs. (16)–(19) represent mini-batch mean, mini-batch variance, the normalized value of \({\text{Conv}}I\) value, and calculated output value. The \(\varepsilon\) value defined in Eq. (18) is a small positive number used to avoid the division by zero error. \(\gamma\), and \(\beta\) parameters are back propagation parameters. The output of batch normalization is passed as input to the ReLU function. ReLU operation was also performed with Eq. (20) [53].
$${\text{ReLU}}\left( {y_{m} } \right) = \left\{ {\begin{array}{*{20}l} x \hfill & { {\text{if}}\, y_{m} > 0} \hfill \\ 0 \hfill & {{\text{otherwise}}} \hfill \\ \end{array} } \right.$$
(20)
In the first step, the convolution block is applied twice in succession, as shown in Fig. 6a. Then, in the first step, the first image data used and the data obtained at the end of the convolution blocks are combined. The result is directed to the second step. The mathematical operations performed in the first step are presented in Eq. (21). The FS expression used in Eq. (21) refers to the data obtained as a result of the first step.
$$\left( {\begin{array}{*{20}l} {{\text{Conv}}I_{1} \left( {x,y} \right)_{i} = {\text{ReLU}}\left( {{\text{BN }}\left( {f_{i} *I\left( {x,y} \right)} \right)} \right)} \hfill \\ {{\text{Conv}}I_{1} \left( {x,y} \right)_{i + 1} = {\text{ ReLU}}\left( {{\text{BN}} \left( {f_{i + 1} *{\text{Conv}}I_{1} \left( {x,y} \right)_{i} } \right)} \right)} \hfill \\ {{\text{FS}}\left( {x,y} \right)_{i} = {\text{Concat}}\left( { {\text{Conv}}I_{1} \left( {x,y} \right)_{i} ,{\text{Conv}}I_{1} \left( {x,y} \right)_{i + 1} } \right)} \hfill \\ \end{array} } \right) ,\;\;\;\;{\text{First Step}} \left( {{\text{FS}}} \right)$$
(21)
In the second step, the image is transferred from two different channels to the third and fourth steps. Within these steps, convolution, batch normalization, and ReLU operations are repeated twice, one after the other. Within these steps, the merge operation is performed as in the first block. The processes in the third and fourth steps are presented in Eqs. (22) and (23).
$$\left( {\begin{array}{*{20}l} {{\text{Conv}}I_{2} \left( {x,y} \right)_{i} = {\text{ReLU}}\left( {{\text{BN}}\left( {f_{tr\_i} *{\text{FS}}\left( {x,y} \right)_{i} } \right)} \right)} \hfill \\ {{\text{Conv}}I_{2} \left( {x,y} \right)_{i + 1} = {\text{ReLU}}\left( {{\text{BN}} \left( {f_{tr\_i + 1} *{\text{ Conv}}I_{2} \left( {x,y} \right)_{i} } \right)} \right)} \hfill \\ {{\text{TR}}\left( {x,y} \right)_{i} = {\text{Concat}}\left( { {\text{Conv}}I_{2} \left( {x,y} \right)_{i} ,{\text{ Conv}}I_{2} \left( {x,y} \right)_{i + 1} } \right)} \hfill \\ \end{array} } \right) ,\;\;\;\;{\text{Third}}\;{\text{Step}} \left( {{\text{TR}}} \right)$$
(22)
$$\left( {\begin{array}{*{20}l} {{\text{Conv}}I_{3} \left( {x,y} \right)_{i} = {\text{ReLU}}\left( {{\text{BN}} \left( {f_{fr\_i} *{\text{TR}}\left( {x,y} \right)_{i} } \right)} \right)} \hfill \\ {{\text{Conv}}I_{3} \left( {x,y} \right)_{i + 1} = {\text{ ReLU}}\left( {{\text{BN}} \left( {f_{fr\_i + 1} *{\text{ Conv}}I_{3} \left( {x,y} \right)_{i} } \right)} \right)} \hfill \\ {{\text{FR}}\left( {x,y} \right)_{i} = {\text{Concat}}\left( {{\text{ Conv}}I_{3} \left( {x,y} \right)_{i} , {\text{Conv}}I_{3} \left( {x,y} \right)_{i + 1} } \right)} \hfill \\ \end{array} } \right) ,\;\;\;\;{\text{Fourth}}\;{\text{Step}} \left( {{\text{FR}}} \right)$$
(23)
The TR term used in Eqs. (22) and (23) refer to the data obtained as a result of the third step, and the FR term refers to the data obtained as a result of the fourth step. The variables \(f_{tr\_i}\) and \(f_{fr\_i}\) refer to the third and fourth kernel filters used. The results from the third and fourth steps are also combined in step fifth (FV). This situation is expressed by Eq. (24).
$${\text{FV}} = {\text{Concat}}\left( {{\text{FR}}\left( {x,y} \right)_{i} ,{\text{TR}}\left( {x,y} \right)_{i} } \right),\;\;\;\;{\text{Five}}\;{\text{Step}} \left( {{\text{FV}}} \right)$$
(24)
These results are transferred both to the down sampling process in a lower layer and to the attention block at the level it is located. In the down sample operation, the Max-Pooling (1 × 2) operation is performed first horizontally and then vertically.
The process performed up to this step is also a kind of feature extraction process. Because new features are obtained from the data with kernel filters. Feature selection must be made based on the obtained features. At this stage, the Max-Pooling method, which is also preferred and applied in the literature, was used [54, 55]. In the Max-Pooling method, the maximum value of the pooling region is selected to prevent the maximum value from being lost. For this reason, the Max-Pooling method presented in Eq. (25) was used to subsample the feature map in the study. At this stage, 1 × 2 size Max-Pooling was used to obtain 1-dimensional data. As a result of the Max-Pooling process, the size of the feature vector of size n × 1, consisting of a single line, decreases from n × 1 to (n/2) × 1.
$$f\left( x \right) = \max \left( {0,x} \right)$$
(25)
$$f\left( x \right) = \min \left( {0,x} \right)$$
(26)
$$f\left( x \right) = {\text{mean}} \left( {0,x} \right)$$
(27)
Apart from these, Min-Pooling and Mean-Pooling methods used in the literature and presented in Eqs. (26) and (27) were also used. However, the results obtained from these methods reduce the system's performance. For this reason, the Max-Pooling method was used in parallel with the literature [54]. After Max-Pooling, it is transferred to Block_En_2. Block_En_2 also unlike Block_En_1, the output of the Block_En_2 is transmitted to 3 different structures. These structures are presented in detail in Fig. 6. Similar operations are performed in Block_En_2 and then transferred to a sublayer Block_En_3. The same operations are repeated until the Block_En_5 structure. As shown in Fig. 6b, when the Block_En_5 structure is reached, the operations in Fig. 6a are repeated four times in a row within the blocks. The result obtained is transferred to the decoder structure by applying the up sample process.
The decoder structure has a layered architecture as presented in Fig. 6b. In this architecture, the values coming from the lower layer are concatenated with the data coming from the attention structure at the same level as itself. Then, the incoming data are transferred to the next layer by applying the up sample process. The same operations are repeated until the Block_De_1 structure. The sizes of all outputs produced by the system up to the Block_De_1 structure are also presented in Table 1. After this process, the values obtained from each block level are transferred to the output layer. For this reason, data expansion has been applied to matrices with small output sizes. The output values obtained in the output layer and their input images and the differences between them are subjected to the feature selection process by the min pooling method. The output with the lowest error value constitutes the output of the entire system. Adding an output layer to the end of the proposed CellSegUNet model in this way and determining the output by selecting the features with the min pooling method in this layer has provided a significant improvement in increasing success.
Table 1
Encoder–decoder output sizes
Encoder
Output sizes
Decoder
Output sizes
Input
512 × 512 × 1
Residual block De 4
64 × 64 × 256
Residual block En 1
512 × 512 × 32
Up convolution
128 × 128 × 256
Pooling
256 × 256 × 32
Residual block De 3
128 × 128 × 128
Residual block En 2
256 × 256 × 64
Up convolution
256 × 256 × 128
Pooling
128 × 128 × 64
Residual block De 2
256 × 256 × 64
Residual block En 3
128 × 128 × 128
Up convolution
512 × 512 × 64
Pooling
64 × 64 × 128
Residual block De 1
512 × 512 × 1
Residual block En 4
64 × 64 × 256
  
Pooling
32 × 32 × 256
  
Residual block En 5
32 × 32 × 512
  
Up convolution
64 × 64 × 512
  

4 Experimental study

4.1 Datasets

Three different datasets were used in the study to examine the effect of the used models on different datasets. One of these datasets in particular is the dataset used in many different studies in the literature. The other two are quite new and promising datasets in the literature. The first dataset used in the study is the Sartorius Cell Instance Segmentation (SCIS) dataset. This dataset is a different version of the LIVECell dataset and is one of the promising datasets. The LIVECell dataset contains 8 different cell types as shown in Table 2 and contains a total of 5239 images. The dataset belonging to these images contains 1.686.353 separate cell location information [56].
Table 2
Distribution of cell types in the LIVECell dataset
Cell Type
Count
Cell type
Count
Shsy5y
704
Bv2
608
Mcf7
735
A172
608
Bt474
672
Huh7
600
Skbr3
704
Skov3
608
Different types of cells vary in size, density, and shape. In particular, Shsy5y with long protrusions and overlapping cells has the lowest segmentation success. Therefore, the SCIS dataset focused on these challenges had been created. There are three different types of neuronal cells in the SCIS dataset, namely Cortical neurons (Cort), Shsy5y, and Astrocyte (Astro). Statistical properties of this cell information are presented in Table 3 [57]. The dataset contains 606 sample images, including 320 Cort, 155 Shsy5y, and 131 Astro samples.
Table 3
Distribution of cell types belonging to SCIS dataset
Type
#Samples
Annotation count
Mask area (# pixels)
Mean
Min
Max
Mean
Min
Max
Astro
131
80
5
594
906
37
13.327
Cort
320
34
4
108
240
33
2.054
Shsy5y
155
337
49
790
224
30
2.254
The second dataset used in the study is the data science bowl (DSB) dataset [58]. This dataset is one of the popular datasets used in many studies in the literature. The data consist of images of cell nuclei collected by specialist personnel in various hospitals and universities. The cell nuclei in the dataset were obtained from mouse, fly, and human cell nuclei. There are a total of 735 images in the dataset used in the study. Two different datasets have been used for the operation of the designed system. The sizes and color channel numbers of the images in the datasets used are different from each other. The images in the SCIS dataset have one channel, while the images in the DSB dataset have three channels. For this reason, the images in the DSB dataset were reduced to one dimension using Eq. (28).
$$I = \frac{R + G + B}{3}$$
(28)
Used in Eq. (28), R represents the red color channel, G represents the green color channel, and B represents the blue color channel. The \(I\) value is used in Eq. (29) to convert the image to binary format.
$$I\left( {x,y} \right) = \left\{ {\begin{array}{*{20}l} {1\;\;\;\;{\text{ if}} I\left( {x,y} \right) > T } \hfill \\ {0\;\;\;\;{\text{otherwise }}} \hfill \\ \end{array} } \right\}$$
(29)
The \(T\) value used in Eq. (29) represents the threshold. 0.5 was used as the threshold value in the study. The \(x\), and \(y\) pairs represent the coordinates of the value on the image. The images in three datasets used in the study are used as input in the segmentation models after they are converted to binary format.
The third dataset used in the study is the Blood-Cell-Segmentation (BCS) dataset [59]. This dataset is a public dataset created from images of microscopic blood cells. It is a dataset introduced to the literature by Depto et al. in 2021 and is also called BBBC041Seg dataset in the literature. It is one of the current datasets in the literature. It contains 1328 blood cell images and 1328 masks belonging to these images. The dimensions of all images in the dataset are 1600 × 1200 pixels. Images of blood cells are in RGB format, and the masks are in binary format.

4.2 Application of experiments

Experimental studies carried out within the scope of this article were carried out on Windows 10 operating system with a 3.2 GHz processor and NVIDIA RTX 3060 GPU-supported graphics card. A programming language with Python version 3.9.7 and Tensorflow 2.5.0 artificial intelligence library was used in the development of the proposed models. Along with this, the general flowchart of the study is presented in Fig. 7.
As seen in Fig. 7, data processing was performed in the first step of the experimental study. In this step, first, the images in three datasets were read separately. Data augmentation was performed on the read data. In the data augmentation process, magnification, zooming, left-to-right rotation, and horizontal rotation were performed. As a result, 3030 images were created in SCIS, the first dataset, and 3675 images were created in the second dataset, DSB. The number of images in the third dataset, the BCS dataset, was increased to 13,280.
Afterward, the images and their masks in the SCIS dataset were resized to 512 × 512 × 1. The images and masks in the DSB and BCS datasets have been resized as 128 × 128 × 1. Multiple cell nuclei can be found in the images. For this reason, each image dataset was evaluated separately. For example, in the SCIS dataset, the masks have annotation information. Masks are created by combining the annotations of each image. In the DSB dataset, there are separate masks for each image and these masks are combined in the application.
In the second step of the flowchart, the models to be used in the experimental study were selected for each dataset. At this step, the CellSegUNet model proposed in the study and the ResUNet and UNet++ models in the literature were used.
In the third step, the images in three datasets were divided into two groups 80% training and 20% testing, according to the cross-validation 5 value. According to the determined parameters, all three models used in the study were trained separately. After the training process, the system was enabled to produce results with the test data.
In the fourth step, which is the last step, the outputs of all models used in the study were analyzed by using the dice, accuracy, and jaccard indices commonly used in the literature [34, 60, 61].
1.
Dice index (Dice): One of the formulas in the literature that measures the similarity ratio between defined basic background pixels and segmented cell nuclei is the dice formula. The content of this formula is presented in Eq. (30) [60].
$${\text{Dice}} = \frac{{2{\text{TP}}}}{{2{\text{TP}} + {\text{FP}} + {\text{FN}}}}$$
(30)
 
2.
Accuracy: The ratio of the number of correctly segmented pixels to the total number of pixels is represented by the accuracy expression [61]. The accuracy formula is presented in Eq. (31).
$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}}$$
(31)
 
3.
Jaccard index (Jaccard): The ratio of correctly detected cell nucleus pixels to correctly and incorrectly detected pixels is called the jaccard index. The jaccard index formula is presented in Eq. (32).
$${\text{Jaccard}} = \frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}} + {\text{FP}}}}$$
(32)
 
The terms TP, TN, FP, and FN are used in Eqs. (30)–(32) and are presented in detail below. TP (True Positive) represents the number of correctly segmented cell nuclei pixels, which are actually cell nuclei pixels. TN (True Negative) represents the number of correctly segmented background pixels, which are actually background pixels. FP (False Positive) refers to the number of how many of those that are actually background pixels are estimated as cell nuclei. FN (False Negative) represents the number of how many of those that actually have cell nuclei pixels predicted as background pixels.
The performance success of each of the test images used in each dataset was measured individually. Then, the average accuracy, jaccard, and dice values of the dataset were calculated.
$${\text{Accuracy}}_{{{\text{mean}}}} = \frac{{\sum\limits_{i = 1}^{i = n} {{\text{Accuracy}}_{i} } }}{n}$$
(33)
$${\text{Jaccard}}_{{{\text{mean}}}} = \frac{{\sum\limits_{i = 1}^{i = n} {{\text{Jaccard}}_{i} } }}{n}$$
(34)
$${\text{Dice}}_{{{\text{mean}}}} = \frac{{\sum\limits_{i = 1}^{i = n} {Dice_{i} } }}{n}$$
(35)
The n value used in Eqs. (33)–(35) represents the number of images used for the test. Each of the \({\text{Accuracy}}_{i}\), \({\text{Jaccard}}_{i}\), and \({\text{Dice}}_{i}\) values are the values obtained from each of the images.

5 Results and discussion

In this section, the application results that aim to measure the segmentation success of the proposed CellSegUNet, UNet++ , and ResUNet models are explained. In each application, three different datasets mentioned in Sect. 4 were used and the general flowchart of the system is also shared in Fig. 7. The proposed CellSegUNet, UNet++ , and ResUNet models were first trained by running 100 and 125 epochs on the SCIS dataset. The training results obtained are shown in Fig. 8. Afterward, the system was trained by running 150 and 175 epochs with the same dataset and models. The training results obtained from these epochs are also shared in Fig. 9. As can be seen from both graphs, increasing the number of epochs does not have an excessively positive effect on the success of the system. But the most striking point here is the results of the UNet++ model. Because the lowest success results from the first epochs were obtained from the UNet++ model. A particularly critical decision has been made at this stage. Basic parameters such as the optimization function, batch size, and loss function used in the UNet++ model and the ResUNet model were used at the same values and different interventions were not made to increase the success specific to the models. Thus, the results of both models under similar conditions could be evaluated. The same parameters are used in the proposed CellSegUNet model.
The same operations were performed on the DSB dataset. First, the system was trained for all three models running 100 and 125 epochs, and the results are presented in Fig. 10. Subsequently, 150 and 175 epochs were run, and the values obtained were shared in Fig. 11. In the DSB dataset, the lowest success results in the first 25 epochs were obtained from the UNet++ model in the training phase. At this stage, as in the SCIS dataset, the basic parameters such as the optimization function, batch size, and loss function used in all three models were used at the same values.
Similar operations were performed on the BCS dataset. First, the system was trained for all three models running 100 and 125 epochs, and the results are presented in Fig. 12. Then, the values obtained by running 150 and 175 epochs are shared in Fig. 13. Basic parameters such as optimization function, batch size, and loss function used in all three models were used at the same values.
In addition, as can be seen from the studies in the literature, the depth levels of the proposed models were kept the same in order to make an accurate performance comparison [51, 61]. For this reason, the depth level of all models used in the study was determined as four. Thus, the results of the three models under similar conditions were evaluated. All the values obtained for three datasets are given in Table 4 in detail. When the training results are examined in general, it is seen that the results of the CellSegUNet model are much better than the ResUNet and UNet++ models. In addition, it is seen that the training results of the ResUNet model are better than the UNet++ model.
Table 4
Training results for SCIS, DSB, and BCS Datasets
 
100
125
150
175
Accuracy
Dice
Jaccard
Accuracy
Dice
Jaccard
Accuracy
Dice
Jaccard
Accuracy
Dice
Jaccard
ResUNet
SCIS
0.832
0.801
0.601
0.835
0.812
0.603
0.838
0.816
0.612
0.841
0.822
0.613
DSB
0.726
0.709
0.681
0.742
0.729
0.707
0.750
0.739
0.722
0.794
0.785
0.772
BCS
0.837
0.791
0.783
0.869
0.827
0.818
0.878
0.837
0.827
0.894
0.854
0.844
UNet++ 
SCIS
0.747
0.575
0.372
0.744
0.578
0.374
0.747
0.58
0.376
0.762
0.607
0.405
DSB
0.582
0.403
0.254
0.62
0.471
0.333
0.622
0.485
0.352
0.654
0.525
0.398
BCS
0.811
0.714
0.744
0.832
0.740
0.770
0.849
0.758
0.788
0.864
0.774
0.804
CellSegUNet
SCIS
0.917
0.922
0.802
0.927
0.924
0.812
0.933
0.941
0.819
0.934
0.959
0.833
DSB
0.925
0.884
0.867
0.936
0.908
0.907
0.932
0.896
0.885
0.985
0.975
0.968
BCS
0.925
0.899
0.885
0.957
0.934
0.917
0.963
0.942
0.923
0.982
0.972
0.942
After the training stage, all three models were tested for three datasets. During the test stage, the accuracy, dice, and jaccard values were calculated for each of the test images. Afterward, the values in Table 5 were obtained by applying Eqs. (33)–(35).
Table 5
Test Results for SCIS, DSB, and BCS
 
Accuracy
Dice
Jaccard
ResUNet
SCIS
0.831
0.802
0.592
DSB
0.704
0.746
0.748
BCS
0.851
0.823
0.791
UNet++ 
SCIS
0.751
0.585
0.381
DSB
0.632
0.513
0.378
BCS
0.813
0.706
0.723
CellSegUNet
SCIS
0.931
0.957
0.829
DSB
0.980
0.970
0.959
BCS
0.976
0.971
0.927
As can be seen in Table 5, when the accuracy values obtained from the SCIS dataset are examined, it is seen that the worst values are obtained from the UNet++ model. The values obtained from the ResUNet model gave better performance than the values obtained from the UNet++ model. In addition, the best performance results were obtained from the proposed CellSegUNet model. When the dice and jaccard values for the same dataset are examined, it is seen that the same order is preserved. However, when the dice and jaccard values obtained were analyzed separately, it was determined that the dice values were higher than the jaccard values.
The pixel-based evaluation of all measurement results in the study was found to be the reason for this. When all the test results on the SCIS dataset are evaluated collectively, it has been determined that the proposed CellSegUNet model has a significantly higher performance than the other two models. Likewise, similar results were obtained from the BCS dataset.
When the accuracy values obtained from the DSB dataset are examined, it is seen that the worst values are in the UNet++ model, as in the SCIS dataset. Similarly, CellSegUNet gives better results than the ResUNet model. In addition to these, the original masks of the test images found in the DSB dataset and the masks obtained from the models used in the study are presented in Table 6. As shown in Table 6, the values obtained from the UNet++ model are quite weak compared to the other two models. On the other hand, the value obtained from the proposed CellSegUNet model is just as good.
Table 6
Some of the test results for the DSB dataset
https://static-content.springer.com/image/art%3A10.1007%2Fs00521-023-09374-3/MediaObjects/521_2023_9374_Tab6_HTML.png
Similarly, the sample images and masks in the SCIS dataset and the masks obtained from the models used in the study are shared in Table 7. When the values in Table 7 are examined, the UNet++ model does not give such bad values as to be noticed. Specialist health personnel is of the same opinion. However, the same images were also analyzed with success metrics. The results of the analysis were found to be different. The difference in the images can be noticed by users only when they are magnified enough. This is proof that the measurement results of computer-aided systems are extremely sensitive.
Table 7
Some of the test results for the SCIS
https://static-content.springer.com/image/art%3A10.1007%2Fs00521-023-09374-3/MediaObjects/521_2023_9374_Tab7_HTML.png
Tables 6, 7 show that the proposed CellSegUNet model performs best in three datasets. Although the success of the CellSegUNet model is high, the reason why the UNet++ model is so bad has been investigated. As a result of the research, it has been determined that the failure of the UNet++ model is due to the connection nodes between the encoder-decoder, which are at the same level in the UNet++ model.
In the UNet++ model, it has been observed that a high node count negatively affects success. Although the node was not used in the model proposed in the study, an attention block was added instead. This means that there is only one node at the same level between the encoder and the decoder. Adding different nodes or structures to the connection between the encoder and the decoder both increases the processing time and negatively affects the success of the system. New research can be done to keep the structures to be used in this interconnection at an optimum level in the future. While examining both tables, some colored images in the DSB dataset draw attention as mentioned in the fourth section of the study. In the study, these colored images were used by converting them to gray format.
The success of the proposed CellSegUNet model was also compared with other studies using the same dataset in the existing literature. The studies in the literature using the SCIS, DSB, and BCS datasets are presented in Tables 8, 9, 10 with the method they used and their successful results respectively.
Table 8
Comparison of the proposed model with other studies using the SCIS dataset in the literature
References and dataset [50]
Methods
Accuracy
Dice
Jaccard
Cai et al. [5]
VSM
0.58
Gehui Li [62]
PAPF-PAN
0.35
Our proposed model
CellSegUNet
0.931
0.957
0.829
Table 9
Comparison of the proposed model with other studies using the DSB dataset in the literature
References and dataset [58]
Methods
Accuracy
Dice
Jaccard
Alom et al. [63]
R2U-Net
0.92
Hollandi et al. [64]
Mask R-CNN + UNet
0.63
Xu et al. [65]
DCSAU-Net
0.95
0.85
Konopczyński et al. [66]
Mask R-CNN + UNet
0.92
0.72
Tran et al. [67]
Trans2UNet
0.92
Jha et al. [68]
DoubleUNet with VGG19
0.91
Deb et al. [69]
UNet
0.92
0.85
BCDUNet
0.92
0.85
DoubleUNet
0.92
0.86
DoubleUNet with DenseNet
0.92
0.86
DoubleUNet with DenseNet and Xception
0.82
0.70
DoubleUNet with DenseNet and NasNet
0.65
0.48
DoubleUNet with DenseNet and InceptionV2
0.89
0.80
Lai et al. [70]
AxialAtt-MLP-Mixer
0.92
0.85
Singha and Bhowmik [1]
AlexSegNet
0.91
Our proposed model
CellSegUNet
0.980
0.970
0.959
Table 10
Comparison of the proposed model with other studies using the BCS dataset in the literature
References and dataset [59]
Methods
Accuracy
Dice
Jaccard
Depto et al. [71]
Otsu’s method
0.926
0.865
BHT
0.525
0.494
Watershed
0.782
0.682
U-Net
0.930
0.871
U-Net++ 
0.888
0.814
TernausNet
0.933
0.876
R2U-Net
0.867
0.777
Attention U-Net
0.910
0.837
Attention R2U-Net
0.785
0.652
FCN
0.854
0.752
Toptas and Hanbay [72]
DeepLabv3+ 
0.960
0.910
Our proposed model
CellSegUNet
0.976
0.971
0.927
At the end of the study, the training times of all models used for three datasets are presented in Table 11. Although there is no node structure in the ResUNet model, it has been determined that the training time is longer than in the UNet++ model. When all three models are evaluated together, it is seen that the minimum training period is in the UNet++ model. The training time of the proposed model is close to the UNet++ model, which performs well only in terms of time. In the CellSegUNet model proposed in the study, the attention block was used similarly to the node structure in the UNet++ model. However, it should be noted that the internal structure of the attention block is completely different from the UNet++ model and only one attention block is used in each layer.
Table 11
Training times (sec) of SCIS, DSB, and BCS datasets
Epochs
CellSegUNet
ResUNet
UNet++ 
Datasets
175
21,205
23,255
20,150
SCIS
25,620
28,750
24,800
DSB
83,570
92,720
80,300
BCS
150
18,176
19,933
17,271
SCIS
21,960
24,643
21,257
DSB
71,350
79,752
68,376
BCS
125
15,146
16,611
14,393
SCIS
18,300
20,536
17,714
DSB
60,523
63,472
55,728
BCS
100
12,117
13,289
11,514
SCIS
14,640
16,429
14,171
DSB
47,923
51,237
44,326
BCS
As a result, the training time of the proposed CellSegUNet model in SCIS dataset in 175 epochs is 9.66% shorter than the ResUNet model, but 5.23% longer than the UNet++ model. In the DSB dataset, on the other hand, the training time in 175 epochs is 12.21% shorter than the ResUNet model, but 3.30% longer than the UNet++ model. In the training phase, it was determined that as the size of the image used in the models decreased, the time difference between CellSegUNet and UNet++ models decreased, whereas the time difference between CellSegUNet and ResUNet increased. From this situation, it is understood that the reduction of the size of the images has a positive effect on the proposed attention block.
When the studies in the literature are evaluated in terms of speed, it can be seen that there are a limited number of studies. The first of these limited studies is the study conducted by Cai et al. [5]. The training time of the model they developed on the SCIS dataset took approximately 3 h for the preliminary model and 18 h for the final model. Training on the CellSegUNet model on the SCIS dataset took approximately 5.9 h. Cai et al. although it is longer than the first model they developed, it has a much shorter training period compared to the latest model. Xu et al. [65] developed a method on the DSB dataset. Although they stated that the method they developed had a shorter inference time than UNet, they did not share any numerical values.
As can be seen from the studies in the literature, clear information about inference times was not found in many studies. In addition, in these studies, it is obvious that there will be differences in the inference times of the models due to the differences in the basic parameters such as the size of the input images, number of epochs, learning rate, batch size, and the hardware used. As a result, it will not provide clear information about the accuracy and reliability of the comparison.
Apart from these, when studies with different datasets are examined, it is seen that Li et al.'s study presents the inference times of the model they developed completely [41]. Although the parameters used in the examined study do not match the parameters of the CellSegUNet model, they provide an idea about model inference times. Based on these studies, it can be seen that the training time of the proposed model is at a reasonable level compared to the studies examined in the literature.
In addition, if the system is evaluated in terms of complexity, it has been determined that the CellSegUNet model is closely related to the depth level, as is the complexity in the classical UNet or UNet++ models. There is also an important connection between the size of the filters in the convolution layers used in the layered architecture of the CellSegUNet model and their output sizes and complexity. The dimensions of the data at the outputs of the residual blocks used in the proposed model are given in detail in Table 1. The output dimensions of these blocks directly affect the complexity of the system.
When the CellSegUNet model was analyzed, the results of all models used in the study on the training and test set are presented in detail in Tables 4 and 5. The uses of UNet++ and residual blocks in the literature also played a decisive role in the development of the proposed model.
In the UNet++ model, in the node structure used between encoder and decoder, the value obtained from the first node at the same level is transferred to the last node. During this transfer, while the value coming from one node increases the success, the value coming from the other node may decrease the success, and very dense skip connections are used for this process. In the residual blocks used in the study, the values obtained from the data are reinforced with skip connections within the block. Since Max-Pooling is applied in the UpSampling process, an approach is taken to obtain consistently large values. Although a simpler transition is provided between the encoder and the decoder compared to UNet++ , the obtained values are transferred to the decoder and sent directly to the output layer to check how far they are from the real image.
When the qualitative analysis of the study is carried out, the residual connections within the blocks used in the encoder and decoder structure of the proposed model ensure that the features are reinforced by preventing them from being lost. This block structure presented in Fig. 6a is a structure that is not used in models such as UNet, UNet++ , and UNet3 + . Thanks to this block structure, the values obtained are transferred to a lower layer by both attention module and down sampling within the CellSegUNet model. Similarly, in the decoder section, it is transferred to the next layer by performing both multiply module and up sampling. The number of block structures used in the study can be increased much more if desired. However, this will significantly increase the complexity of the model and the training time. For this reason, the number of blocks is left at the optimum level in the cellSegUNet model. The output layer used in the study is a new structure that was developed specifically for the model and is not used in UNet-based approaches in the literature.
The basic logic of this structure is based on the fact that when more than one feature is used, one feature may increase performance while another feature may decrease performance. This situation is undesirable and an attempt was made to prevent this with the output layer used in the study. Different features are sent from each level to the output layer used in the CellSegUNet model, and the difference values between the input and output are measured and evaluated separately. As a result, the success of the system increases.
The best results of the CellSegUNet model were obtained as 0.98 accuracy and 0.959 jaccard on the DSB dataset. CellSegUNet also provided better training and testing results on other datasets than the UNet++ and ResUNet models compared in the literature. This situation is shown in Table 6, 7, and 8. Although it is seen that the CellSegUNet and ResUNet models provide similar results, it is seen that the CellSegUNet model gives better results, especially in detail. On the other hand, the results of the UNet++ model, which is used as the main backbone, appear to be lower. In particular, it was determined that the UNet++ model gave accuracy, dice, and jaccard values of 0.632, 0.513, and 0.378, respectively, on the DSB dataset.

6 Conclusions

UNet-based segmentation models used in the literature in recent years have formed the basis for very important structures that can help reveal the mysteries of cell structures. In these approaches used in the literature, segmentation success may be insufficient in some images due to various difficulties such as irregular shapes, different cell densities, negative external factors, and overlapping of cell nuclei images.
In the study, a new model called UNet++ and residual block-based, CellSegUNet, which takes advantage of the features between encoder and decoder, is proposed to realize an efficient and lightweight segmentation network architecture. In the proposed model, blocks in the encoder structure have been redesigned, unlike the classical UNet++ and ResUNet.
The designed system has been tested on DSB, SCIS, and BCS datasets. The model proposed during the testing phase was compared with UNet++ and ResUNet used in the literature. In the system performance measurements, the accuracy, dice, and jaccard performance metrics were used. As a result of the experimental study with the proposed model, 0.931, 0.957, and 0.829 values were obtained for accuracy, dice, and jaccard, respectively, in the SCIS dataset. For the DSB dataset, accuracy, dice and jacquard values were obtained as 0.980, 0.970 and 0.959, respectively. Similarly, for the BCS dataset, accuracy, dice and jacquard values were obtained as 0.976, 0.971 and 0.927, respectively.
In three datasets, the proposed model was found to give better results than the other two models. All obtained values are presented in detail in Table 5 in Sect. 5. The results of the experimental study were compared with those of other studies using the same dataset. The proposed model was found to have a higher success rate as a result of this comparison. A detailed comparison table with respect to the literature is also presented in Table 8, 9, 10 in Sect. 5.
There are two important issues underlying the fact that the proposed model has a shorter training time compared to the ResUNet model and provides a higher success rate than all the tested models. The first of these is the attention block structure proposed in the study. The most important point that contributes to the performance of this block structure is that it can simultaneously evaluate three different states of data before, after, and at its current level. Underneath this point lies the logic of the recurrent neural network structure.
Another issue underlying the high success rate is the output layer added to the end of the designed model. Thanks to this layer, the values obtained from each depth level of the model can be evaluated separately. The basic idea underlying this structure is the "winner takes all" logic used in neural network architectures. The value that gives the best results from the data evaluated in the output layer constitutes the output of the whole system. Thus, unnecessary outputs are not included in different calculations, contributing to the reduction in the training time of the system. In addition, all three models with the same depth level were compared in terms of training time. At this stage, it has been observed that the training period of the proposed CellSegUNet model is quite short compared to the ResUNet model. However, it has been determined that it is slightly more than the UNet++ model. Considering the successful performance of the CellSegUNet model, the time difference with the UNet++ model is negligible. Additionally, although the success of the ResUNet model is higher than the UNet++ model, the training period is longer than the UNet++ model. For this reason, if success is compromised and fast training is preferred, the UNet++ model is a usable method. However, it should be noted that the proposed model has higher success than the other two models.
In the experimental study, it was determined that the connection between the encoder and decoder at the same level directly affects the segmentation success and processing time. For this reason, it is predicted that in the future, node-like attention block studies in different structures can be carried out that will optimize the connection between the encoder-decoder. In addition, since the model proposed in the study has a layered architecture, it is anticipated that it can be used in different applications by changing the depth level. Based on this basic idea, it is targeted to be used in segmentation datasets belonging to different sectors besides biomedical.

Declarations

Conflict of interest

The author declares that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literature
2.
go back to reference Narotamo H, Sanches JM, Silveira M (2019) Segmentation of cell nuclei in fluorescence microscopy images using deep learning. In: Iberian conference on pattern recognition and image analysis, pp 53–64 Narotamo H, Sanches JM, Silveira M (2019) Segmentation of cell nuclei in fluorescence microscopy images using deep learning. In: Iberian conference on pattern recognition and image analysis, pp 53–64
5.
go back to reference Cai X, Cai H, Xu K, Tu W-W, Li W-J (2022) VSM: a versatile semi-supervised model for multi-modal cell instance segmentation Cai X, Cai H, Xu K, Tu W-W, Li W-J (2022) VSM: a versatile semi-supervised model for multi-modal cell instance segmentation
7.
go back to reference Tajbakhsh N, Jeyaseelan L, Li Q, Chiang JN, Wu Z, Ding X (2020) Embracing imperfect datasets: a review of deep learning solutions for medical image segmentation. Med Image Anal 63:101693CrossRefPubMed Tajbakhsh N, Jeyaseelan L, Li Q, Chiang JN, Wu Z, Ding X (2020) Embracing imperfect datasets: a review of deep learning solutions for medical image segmentation. Med Image Anal 63:101693CrossRefPubMed
10.
go back to reference Zhang Y, Yang Q (2021) A survey on multi-task learning. IEEE Trans Knowl Data Eng 34(12):5586–5609CrossRef Zhang Y, Yang Q (2021) A survey on multi-task learning. IEEE Trans Knowl Data Eng 34(12):5586–5609CrossRef
19.
go back to reference Çetiner H (2022) Citrus disease detection and classification using based on convolution deep neural network. Microprocess Microsyst 104687 Çetiner H (2022) Citrus disease detection and classification using based on convolution deep neural network. Microprocess Microsyst 104687
21.
go back to reference Çetiner H, Çetiner İ (2022) Classification of cataract disease with a DenseNet201 based deep learning model. j Inst Sci Technol 12(3):1264–1276 Çetiner H, Çetiner İ (2022) Classification of cataract disease with a DenseNet201 based deep learning model. j Inst Sci Technol 12(3):1264–1276
23.
go back to reference Ben-Cohen A, Diamant I, Klang E, Amitai M, Greenspan H (2016) Fully convolutional network for liver segmentation and lesions detection BT - deep learning and data labeling for medical applications, pp 77–85 Ben-Cohen A, Diamant I, Klang E, Amitai M, Greenspan H (2016) Fully convolutional network for liver segmentation and lesions detection BT - deep learning and data labeling for medical applications, pp 77–85
24.
go back to reference Natesan P, Keerthika S, Gothai E, Thamilselvan R (2021) Generative adversarial network with masking bits based image augmentation technique for nuclei image classification. In: 2021 5th international conference on computing methodologies and communication (ICCMC), pp 1700–1705, https://doi.org/10.1109/ICCMC51019.2021.9418416 Natesan P, Keerthika S, Gothai E, Thamilselvan R (2021) Generative adversarial network with masking bits based image augmentation technique for nuclei image classification. In: 2021 5th international conference on computing methodologies and communication (ICCMC), pp 1700–1705, https://​doi.​org/​10.​1109/​ICCMC51019.​2021.​9418416
26.
go back to reference Ayalew YA, Fante KA, Mohammed MA (2021) Modified U-Net for liver cancer segmentation from computed tomography images with a new class balancing method. BMC Biomed Eng 3(1):1–13CrossRef Ayalew YA, Fante KA, Mohammed MA (2021) Modified U-Net for liver cancer segmentation from computed tomography images with a new class balancing method. BMC Biomed Eng 3(1):1–13CrossRef
27.
go back to reference Li D et al (2021) Robust blood cell image segmentation method based on neural ordinary differential equations. Comput Math Methods Med 2021 Li D et al (2021) Robust blood cell image segmentation method based on neural ordinary differential equations. Comput Math Methods Med 2021
28.
go back to reference Kumar SN et al (2021) Lung nodule segmentation using unet. In: 2021 7th International conference on advanced computing and communication systems (ICACCS), vol 1, pp 420–424 Kumar SN et al (2021) Lung nodule segmentation using unet. In: 2021 7th International conference on advanced computing and communication systems (ICACCS), vol 1, pp 420–424
30.
go back to reference Minaee S, Boykov YY, Porikli F, Plaza AJ, Kehtarnavaz N, Terzopoulos D (2021) Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3523–3542 Minaee S, Boykov YY, Porikli F, Plaza AJ, Kehtarnavaz N, Terzopoulos D (2021) Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell 44(7):3523–3542
32.
go back to reference Shan T, Yan J (2021) SCA-Net: a spatial and channel attention network for medical image segmentation. IEEE Access 9:160926–160937CrossRef Shan T, Yan J (2021) SCA-Net: a spatial and channel attention network for medical image segmentation. IEEE Access 9:160926–160937CrossRef
33.
go back to reference Guo C, Szemenyei M, Yi Y, Wang W, Chen B, Fan C (2021) Sa-unet: spatial attention u-net for retinal vessel segmentation. In: 2020 25th international conference on pattern recognition (ICPR), pp 1236–1242 Guo C, Szemenyei M, Yi Y, Wang W, Chen B, Fan C (2021) Sa-unet: spatial attention u-net for retinal vessel segmentation. In: 2020 25th international conference on pattern recognition (ICPR), pp 1236–1242
35.
go back to reference Zhao C, Han J, Jia Y, Gou F (2018) Lung nodule detection via 3D U-Net and contextual convolutional neural network. In: 2018 International conference on networking and network applications (NaNA), pp 356–361 Zhao C, Han J, Jia Y, Gou F (2018) Lung nodule detection via 3D U-Net and contextual convolutional neural network. In: 2018 International conference on networking and network applications (NaNA), pp 356–361
36.
go back to reference Zhang R, Huang L, Xia W, Zhang B, Qiu B, Gao X (2018) Multiple supervised residual network for osteosarcoma segmentation in CT images. Comput Med Imaging Graph 63:1–8CrossRefPubMed Zhang R, Huang L, Xia W, Zhang B, Qiu B, Gao X (2018) Multiple supervised residual network for osteosarcoma segmentation in CT images. Comput Med Imaging Graph 63:1–8CrossRefPubMed
47.
go back to reference Das S, Deka A, Iwahori Y, Bhuyan MK, Iwamoto T, Ueda J (2019) Contour-aware residual W-Net for nuclei segmentation. Proc Comput Sci 159:1479–1488CrossRef Das S, Deka A, Iwahori Y, Bhuyan MK, Iwamoto T, Ueda J (2019) Contour-aware residual W-Net for nuclei segmentation. Proc Comput Sci 159:1479–1488CrossRef
48.
go back to reference Baldeon-Calisto M, Lai-Yuen SK (2020) AdaResU-Net: Multiobjective adaptive convolutional neural network for medical image segmentation. Neurocomputing 392:325–340CrossRef Baldeon-Calisto M, Lai-Yuen SK (2020) AdaResU-Net: Multiobjective adaptive convolutional neural network for medical image segmentation. Neurocomputing 392:325–340CrossRef
49.
go back to reference Ibtehaz N, Rahman MS (2020) MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw 121:74–87CrossRefPubMed Ibtehaz N, Rahman MS (2020) MultiResUNet: rethinking the U-Net architecture for multimodal biomedical image segmentation. Neural Netw 121:74–87CrossRefPubMed
50.
go back to reference Kaggle (2022) Sartorius- cell instance segmentation Kaggle (2022) Sartorius- cell instance segmentation
52.
go back to reference Cao Y, Xu J, Lin S, Wei F, Hu H (2019) Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF international conference on computer vision workshops Cao Y, Xu J, Lin S, Wei F, Hu H (2019) Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF international conference on computer vision workshops
65.
go back to reference Xu Q, Duan W, He N (2022) DCSAU-Net: A deeper and more compact split-attention U-Net for medical image segmentation. arXiv Prepr. arXiv:2202.00972 Xu Q, Duan W, He N (2022) DCSAU-Net: A deeper and more compact split-attention U-Net for medical image segmentation. arXiv Prepr. arXiv:2202.00972
66.
go back to reference Konopczyński T et al (2022) Instance segmentation of densely packed cells using a hybrid model of U-Net and mask R-CNN BT - artificial intelligence and soft computing, pp 626–635 Konopczyński T et al (2022) Instance segmentation of densely packed cells using a hybrid model of U-Net and mask R-CNN BT - artificial intelligence and soft computing, pp 626–635
72.
go back to reference Toptaş M, Hanbay D (2023) Segmentation of microscopic blood cell images with current deep learning architectures. J Eng Sci Res 5(1):135–141 Toptaş M, Hanbay D (2023) Segmentation of microscopic blood cell images with current deep learning architectures. J Eng Sci Res 5(1):135–141
Metadata
Title
CellSegUNet: an improved deep segmentation model for the cell segmentation based on UNet++ and residual UNet models
Author
Sedat Metlek
Publication date
13-01-2024
Publisher
Springer London
Published in
Neural Computing and Applications / Issue 11/2024
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-023-09374-3

Other articles of this Issue 11/2024

Neural Computing and Applications 11/2024 Go to the issue

Premium Partner