Introduction

Globally, as a major public health problem, low back pain has been the leading cause of disability worldwide for the past 30 years, with a burden on individuals, healthcare, and society1. IVD herniation, spinal stenosis, or ossification of the facets may pinch nerves, which may contribute to worsening LBP disease states. As an early pathological phenotype of LBP, IVD degeneration is necessary but hardly being quantified. IVD comprises a gel-like nucleus pulposus (NP), collagenous annulus-fibrosis (AF) layers, and ring-like cartilaginous endplates (EP), playing an important role in mechanical transmitting loads from body weight and daily activity through the spine column2.

The extracellular matrix such as collagen and aggrecan provide tensile strength and osmotic-pressure regulation3,4, which may degrade with aging. Accumulated compressive loads may accelerate this progressive process in IVD degeneration5,6,7, which may lead to LBP with increased inflammation8, nerve compression9, and release of pain factors4. For IVD degeneration, T2-weighted (T2W) MRI is excellent at detecting the morphologic changes, including height loss and water-intensity loss. Pfirrmann et al. developed a grading system for IVD degeneration according to signal intensity and geometric features, which is one of the most accepted around the world10. However, it is highly dependent on the level of reader expertise, which showed only moderate interobserver agreement between readers in most studies5, and the qualitative grading method cannot accurately reflect progressive changes in IVD-degeneration process11. Some measured IVD quantitative parameters were reported in early researches12,13,14,15,16,17, which showed a better capacity to reflect the aging effect of IVD degeneration with relatively lower measurement error, thereby improving the quality of research on intervertebral disc degeneration18. However, the consistency and efficiency are not good enough to widely use, due to the need in manually segmenting the relevant area of IVDs and marking the corresponding feature points, as well as the inevitable subjective error and the limitation of the grayscale discrimination ability of human eyes. To some extent, researches on etiology and pathogenesis of IVD degeneration are progressing slowly with the limitation of measurement methods16. In addition, many studies reported the relationship between IVD height and demographic factors such as age and gender19,20,21,22,23. Although Shao et al.17 established a linear model of IVD height-related parameters, the relationships between IVD height and other factors were not unified yet.

With the development of machine learning and deep learning, many studies regarded IVD-degeneration grading as a classification task. From the “shallow learning” task of manually making the degenerative features of the IVD24,25,26,27, to the “deep learning” task of using the entire lumbar IVD bounding box to let the convolutional network learn the degenerative features by itself28, the accuracy of these classification methods is comparable to that of radiology experts. However, these methods still require some manual input or complex detection algorithms, and hardly reflect progressive IVD-degeneration process. Meanwhile, many studies used manual segmentation methods to achieve quantitative analysis of the signal intensity and height characteristics of IVDs13,14,16. There has also been some studies on the quantitative measurement of intervertebral discs based on deep learning, but they did not use quantitative data to evaluate IVD degeneration, which may cause subjective bias and limit its clinical application18,29. Although the U-Net semantic segmentation model for the first time to achieve automatic segmentation and feature extraction of IVD-related regions18, they did not reveal the IVD-degeneration process from their extracted geometric parameters.

Here, we developed an improved deeplabv3+ segmentation network with newly designed modules and a quantitative method for IVD degeneration. After evaluating model performance in segmentation accuracy, quantitation consistency, and application compatibility, a baseline characteristic of IVD signal intensity and geometric morphology among different gender and age and lumbar segments was established by extracting over 1000 MR images of a large patient population at different institutions in China, to develop a quantitative IVD-degeneration structured report. A diagram of this workflow is illustrated in Fig. 1.

Fig. 1: A flowchart of the study process from training and testing phase to data analysis phase with BianqueNet.
figure 1

Mid-sagittal T2W lumbar MR images exported into two different resolutions (512*512, 320*320) are used to train and test two models (Model A, Model B). After segmentation accuracy evaluation, IVD parameters are quantified based on IVD-related area. As Model A shows good performance on quantitation consistency for MR images with various resolutions, it is used to establish baseline characteristic with different IVD degeneration among 5255 IVDs. Workflow diagram at the bottom presents the segmentation network, IVD quantitation method, and IVD degeneration determination.

Result

Segmentation-performance improvement with modelus (DFE, ST-SC, and MFF)

Figure 2 and Table 1 depict segmentation performance of the BianqueNet model with or without modelus (DFE, ST-SC, and MFF). The segmentation performance of deeplabv3+ network without any modelus showed a moderate accuracy, whose mDice and mIoU were 94.45% and 89.88% for the whole lumbar spine, 96.71% and 93.66% for vertebral body, and 94.38% and 89.43% for IVD. The segmentation performance of BianqueNet showed a better accuracy, whose mDice and mIoU were 94.70% and 90.35% for the whole lumbar spine, 97.03% and 94.25% for vertebral body, and 94.80% and 90.19% for IVD, indicating that segmentation performance of deeplabv3+ combined with the three modules (DFE, ST-SC, and MFF) has been improved significantly. Even in the smaller sample Data Set B, BianqueNet also showed good segmentation performance.

Fig. 2: The segmentation performance of BianqueNet in three typical cases and the influence of different segmentation accuracy on feature-point detection and calculation.
figure 2

Segmentation results in case 1 and case 2 indicate that detailed information on the boundaries of vertebral bodies and IVDs is hard to detect. Segmentation in case 3 shows that irregular boundaries between the IVD and the vertebral bodies may interrupt segmentation with slight structural lesions or imaging defects. Feature-point-extracted results indicate that the precise segmentation may significantly improve the corner detection on vertebral bodies (red dots), thereby affecting the calculation results of the characteristic points on IVDs (green dots).

Table 1 Segmentation performance of BianqueNet compared with those without other modules (DFE, ST-SC, and MFF) in test sets.

Notably, our model’s segmentation contained more accurate and detailed structural information in IVDs and vertebral bodies in case 1 and case 2, while better discrimination in the boundary of IVD and vertebral body can be seen in case 3 (Fig. 2). The enhancement of the segmentation performance can significantly improve the accuracy of the corner detection for subsequent IVD-degeneration calculation, as shown in the feature-point part of Fig. 2.

Segmentation performance in clinical sites with different magnetic field strength

To test whether the model trained with MR images from Longhua Hospital (Data Set A) is applicable for other MR images from different hospitals, 60 MR images from other hospitals (Data Set C) and 80 MR images from Longhua Hospital (Data Set A) were randomly selected and segmented with the researcher and BianqueNet. Supplementary Table 1 depicts segmentation performance around these four hospitals. The segmentation performance for MR image from Dongzhimen Hospital was acceptably moderate, while those for other two hospitals showed no significant difference with the training set (Longhua Hospital).

Quantitation performance in different MR images with different resolutions

A total of 230 IVDs and 276 vertebral bodies of 46 subjects were segmented after resolution had been adjusted from 320*320 to 512*512. The results showed a good consistency in using different parameter-calculation algorithms for MR images with different resolutions. Among them, the normalized IVD geometric parameters (DHI and HDR) have extremely high ICC values, which are 0.958 (p = 0.000) and 0.956 (p = 0.000), respectively, while the normalized IVD signal intensity (ΔSI) showed high ICC value of 0.874 (p = 0.000), as shown in Supplementary Table 2. The reason why we set MR image resolution of 512*512 for final model input and model A for the final model is because a large proportion of images present resolution of 512*512 among all collected MR images. Considering that using interpolation method may miss or change information from the image by reducing or enlarging image sizes, we finally chose the middle-image resolution of 512*512 to retain the original information of MR images at the most extent.

Comparison with automatic quantitation and manual measurement

For comparison, the mean values of DH, DHI, and HDR were set as control to test model quantitation performance (Table 2). Thanks to carefully measuring all the IVDs by two senior residents, the intraobserver agreement between the two residents’ measurements on the 75 IVDs in 15 MR images presents ICC value of 0. 944 for DH (95% CI: 0.912, 0.964), 0.913 for DHI (95% CI: 0.862, 0.945), and 0.881 for HDR (95% CI: 0.730, 0.939), indicating a good interobserver agreement in all the IVD-related area measurements and index calculation between the two senior residents. Subsequently, mean of their measurements, as a control, was used to compare with the results extracted by the proposed network. There was moderate-to-good intraobserver agreement between machine and manual measurements with ICC value of 0.954 for DH (95% CI: 0.928, 0.971), 0.908 for DHI (95% CI: 0.856, 0.941), and 0.917 for HDR (95% CI: 0.810, 0.957).

Table 2 Consistency analysis of IVD parameters measurements between model and residents.

Compared with clinical radiologists and residents, the model provided highly repeatable and accurate IVD geometric measurements. In particular, the consistency in VB area-measured results between model and residents was the highest (ICC: 0.964), which may accord with segmentation validation in these areas (Fig. 2). The consistency in IVD area-measured results between model and residents was relatively lower because area for average IVD height calculation was selected with residents’ own discrimination (Supplementary Fig. 1), while it was detected with featured points by model calculation (Fig. 6h). For calculated parameters (DH, DHI, and HDR), both measurement error and selection subjectivity may affect consistency between model and residents, but our evaluation result still presents acceptable good performance (ICC > 0.9).

Model performance in patient subgroups by gender, age, and segments

After screening 1508 MRI images in 4 sites around China, a total of 1051 individuals were collected, in which there are 73 excluded for unaligned outlines (diagnosed as lumbar spondylolisthesis), 45 excluded for abnormal signal intensity distribution (diagnosed as spine tumors), 364 excluded for irregular structures (diagnosed as IVD herniation or vertebral body ossification), and 144 excluded for imaging quality (segmentation results and corner detection did not meet the requirements of parameter calculation). The demographic information (age and gender) distributed evenly as shown in Supplementary Table 3, which was integrated to conduct correlation analysis with IVD parameters.

Supplementary Fig. 2 and Table 3 show comprehensive baseline characteristics of IVD parameters in a larger population. ΔSI in IVDs decreased with age, while DH of IVDs increased with age, reaching peak at the age of 50–60 (P < 0.01). There is no significant difference between male and female in ΔSI in IVDs, while DH, DHI, and HDR of IVDs were significantly higher in males than those in females (P < 0.01). In addition, DH, DHI, and HDR were significantly higher in lower segmental IVDs (L3–L4, L4–L5 and L5–S1) than upper ones (L1–L2 and L2–L3), and disc height of L4–L5 IVDs was the highest (P < 0.01). Through multivariate linear-regression analysis (MLR), we investigated the distribution of IVD geometric parameters and signal intensity of each segment in a large population with different age and gender (shown in Table 5). Variables such as gender and segments have significant correlations with ΔSI, while variables such as age, gender, and segments have significant correlations with geometric parameters.

Table 3 The results of multiple regression analysis of \(\triangle {{{{{{\rm{SI}}}}}}}\), DH, DHI, HDR and gender, different ages, and different segments.

Validity in IVD-degeneration grading performance

Considering height decrease and water-content loss with IVD degeneration, a regression analysis was conducted to investigate the correlation between IVD parameters and degeneration grading (ΔSI with corresponding grading (1, 2, 3, 4, and (5–8)), geometric parameters (DH, DHI, and HDR) with corresponding grading ((1–5), 6, 7 and 8)) in each segment. As shown in Table 4, IVD parameters showed a good accordance to the modified Pfirrmann grade.

Table 4 Correlations between IVD parameters and modified Pfirrmann grading.

The result from a stronger correlation between the modified Pfirrmann grade (1, 2, 3, 4, and (5–8)) and ΔSI (ρ = −0.966, P = 0.000), which demonstrates that the water content of NP is decreasing with the whole IVD-degeneration process. Therefore, specific ranges of ΔSI according to the modified Pfirrmann grade (1, 2, 3, 4, and (5–8)) were calculated and set as automatic grading criteria, which are shown in Fig. 3a and Supplementary Table 4.

Fig. 3: Baseline characteristics of IVD parameters in geometric and signal intensity.
figure 3

The mean and standard deviation (σ) of the ΔSI of each of the modified Pfirrmann grading system (levels 1, 2, 3, 4, and 5) were calculated from Dataset C, which is used to quantify IVD degeneration (a), ΔSI (b), DH (c), DHI (d), and HDR (e) were quantified in different age, gender, and segments to establish population baseline.

According to these statistical results, ranges of IVD geometric parameters and levels of IVD signal intensity for each segment in these participates of different ages and genders were established as Chinese population baseline, which is shown in Fig. 3 and Supplementary Tables 58.

Although our method is focused on quantitative measurement other than degeneration-grade classification, it still presents strong accordance with manual IVD-degeneration grading (macroF1: 92.02% and 90.63% in two data sets) by means of quantifying IVD degeneration, which is shown in Table 5.

Table 5 Accuracy of IVD degeneration grading with ΔSI in IVD.

Discussion

In this work, we propose an automatic IVD-degeneration quantitative method based on deep-learning segmentation, in which a powerful semantic segmentation network (BianqueNet) was designed to achieve accurate segmentation with IVD-related areas from T2W MR images. In the quantitation section, an improved histogram method was proposed, and automatic calculation methods were modified to qualify signal intensity and geometric information of vertebral bodies and IVD. To investigate baseline characteristics of IVD, this method was used in a large population to collect IVD geometric parameters (structural collapse) and signal intensity (water content) with different degeneration grade, age, and gender. A IVD-degeneration quantitative criteria in different population subgroups were finally established by correlation analysis and multiple-regression analysis. Finally, the deviation method was used to achieve the degeneration grading and quantitative analysis on IVDs.

The deep-learning approach allows the network to perform lumbar IVD segmentation and parameter quantitation simultaneously, which may help doctors and patients obtain more IVD-degeneration status information from traditional T2W MR images. Considering large time-consuming and high internal/external differences of IVD manual measurement shown in previous studies, this approach may provide a relatively efficient and accurate solution in a large population to extract more consistent IVD parameters and benefit several clinical applications (such as preventive screening, therapeutic evaluation, decision secondary, and mechanism investigation).

To provide a valuable diagnosis tool for IVD degeneration, quantitative-analysis methods may improve currently used qualitative classification methods15. Although many IVD quantitative methods were proposed and developed, the measured IVD parameters were not accepted and used with limited reliability and validity12,13,15,30. In this work, based on previous studies, some appropriate improvement was designed in signal intensity and geometric information, achieving automatic IVD parameter extraction by means of the latest deep-learning and image processing techniques. The precise segmentation of IVD-related areas is a key step to achieve the automatic extraction of IVD parameters. The proposed DFE and MFF modules integrate multi-scale-rich semantic information to improve the capacity of scene analysis, which may improve the ability to distinguish boundaries between vertebral bodies and IVDs. Meanwhile, the proposed ST-SC module can increase the focus on target information in different spatial domains, providing more accurate and fine-grained structural information for the upsampling path, to enhance the model obtaining more accurate and detailed contour information of IVDs and vertebral bodies. Notably, the trained model has been validated among additional data sets from other three hospitals, demonstrating that BianqueNet may perform consistency segmentation on MR images from different machines.

A normalized processing was added to improve general performance and application, by optimizing the IVD histogram analysis method. Benefiting from powerful segmentation performance of BianqueNet, our proposed IVD quantitative method may be able to precisely detect angle points of vertebral bodies and rapidly calculate feature points of IVDs, thus achieving an automatic IVD area-based quantitation in the first time. In addition, consistency evaluation indicated no significant difference between automated approach and senior radiologists/orthopedic residents’ measurements.

The normalized ΔSI showed excellent linear correlation with IVD degeneration (R = −0.966, P = 0.000), suggesting that IVD histogram analysis is a suitable tool for objective and continuous IVD-degeneration classification, which is similar to the conclusion of Waldenberg et al.13 Thus, we statistically analyzed IVD characteristic in ∆SI with different degeneration grades (corresponding to the modified Pfirrmann grade system), in which the results showed that the histogram features of IVD signal intensity had strong applicability in IVD-degeneration grading.

As IVD-structure collapse status is another important reference in IVD-degeneration grading, we carried out a correlation analysis between IVD geometric parameters (DH, DHI, and HDR) and IVD-structure collapse status. HDR presents the strongest linear correlation with IVD structural collapse, due to its containing both height and shape information that may be the main references in manual degeneration grading. In contrast, DHI showed limited correlation with manual grading, for the height of vertebral bodies used in DHI calculation is not likely considered in manual degeneration grading according to both Pfirrmann grade system and its modified version. Considering the variability of individual IVD height, it may affect the overall linear correlation between the geometric parameters and IVD-degeneration grades in structural collapse.

We further investigated the relationship between IVD quantitative parameters (DH, DHI, HDR, and ΔSI) and baseline demographic information (age, gender, and segment) with a multiple-regression analysis. IVD signal intensity (ΔSI) showed a stronger correlation with age and segment, indicating that accumulated loads may lead to water-content loss in IVD. On the other hand, height of IVD presents a phenomenon of increasing in young age and decreasing in older age, which accorded with previous studies23. However, our study revealed that there are no linear relationships between IVD geometric parameters (DH, DHI, and HDR) and age, which showed disagreement with previous studies19. For the DH parameter, the influence of gender is greater than that of the age. For the normalized parameters DHI and HDR, the influence of gender and age seems similar. The position of the structure shows the greatest influence on the geometric parameters. In fact, Pfirrmann grade system was designed based on symptomatic patients with an average age of about 40 years old19, whose reliability for early IVD degeneration or IVD degeneration in the elderly people may be unsatisfied. In our study, these correlation results played important roles in IVD quantitative degeneration-criterion establishment, achieving automatic quantitation for IVD degeneration in asymptomatic patients of different ages.

An important distinction of our study from previous works is the accurate IVD-parameter extraction from MR images of a large population to establish a criterion of IVD-degeneration quantitation, by means of a powerful segmentation network (BianqueNet) and improved area-based calculation.

Regarding future clinical practice and assessment, we will insert this network into MR-image system (Computer Software Intellectual Property Right, National Copyright Administration, P.R. China, No. 2021SR1211447) and export a structural lumbar intervertebral disc degeneration report like Supplementary Fig. 3 and Video 1 for doctors, patients, and researchers. Compared with traditional text-description MRI report, our quantitative report may provide more accurate IVD parameters to reflect height collapse and water-content loss with IVD degeneration. According to IVD baseline characteristic criteria in each age, gender, and segments, deviation of IVD geometric parameters and Pfirrmann grade based on signal intensity will be obtained automatically to reflect both structural collapse status and water-content loss in IVD comprehensively, which may provide more precise information for clinical practice (lumbar MR-image structural report), clinical trials (efficacy assessment), and mechanism investigation (biomechanics research and finite-element analysis). Notably, these baseline characteristics will be updated dynamically as these MR-image data are collected and summarized.

In our result, we found that the turning point for “peak IVD height” is in the 50–60 age range, which may be a secondary degenerative process. Due to the changes in vertebral osteoporosis, the endplate of the vertebral body becomes more depressed, which may make IVD sink into the vertebral body, resulting in lower vertebral height and higher disc height20,31.

Also, there are many studies on the relationship between age and height of IVD and vertebral bodies. H.S. Monoo-Kuofi et al.32 concluded that IVD height increases with age, but not in a linear fashion, with alternating periods of overgrowth and thinning, and a significant decrease of 2.5% after age 50. These studies support our results. In the future, we will continue to collect data from more MRI and possibly investigate these IVD parameters as aging.

Our study has some limitations. First, this retrospective study may be subject to potential selection bias. Some prospective studies should be rigorously conducted to test the clinical utility of this proposed model. Second, our deep learning model was trained and tested using Chinese patients, so its reproducibility among different ethnic people should be further evaluated. In future, it will be important to combine radiomics and prospective design and integrate all kinds of clinical examination, fluid-flow biomechanics, and molecular approaches to improve accuracy in IVD-degeneration evaluation.

In conclusion, we present a fully automated deep-learning-based lumbar-spine segmentation network and an area-based quantitative method to evaluate IVD degeneration according to the extracted parameters from a large population. Our approach can be used to improve IVD-degeneration evaluation with high accuracy and consistency.

Methods

Patients and datasets

This study was approved by the Institutional Review Board (IRB) in all the participating sites. Written informed consent was waived because of the retrospective nature of the data collection (age/gender) and the use of deidentified MR images.

Two separate segmentation models were trained and tested among mid-sagittal T2 lumbar-spine images of different resolution (512*512 in Data Set A and 320*320 in Data Set B). All the subjects’ lumbar-spine MR imaging was included in the Longhua Hospital, Shanghai University of TCM between January 1, 2019, and December 31, 2020, among which there are 223 subjects using a 1.5-T MRI unit (MAGNETOM Aera XJ, SIEMENS, Data Set A) and 63 subjects using another 1.5-T MRI unit (MAGNETOM Avanto, SIEMENS, Data Set B). These MR images were exported and randomly allocated into each training set or test set (Fig. 1). All images were labeled by LabelMe (version 3.3.6, CSAIL, Massachusetts Institute of Technology)33. Based on the structural features mentioned in the modified Pfirrmann grading system, the segmentation area of 14 parts included 5 vertebral bodies (L1–L5), 5 lumbar IVDs (L1/L2–L5/S1), sacrum (S1), presacral fat area, cerebrospinal fluid area (CSF) in the spinal canal, and background as Fig. 4b. Segmentation performance of IVD-related areas was tested by the mean Dice coefficients (mDice) and mean Intersection over Union (mIoU).

Fig. 4: The proposed BianqueNet consisted of three innovative modules.
figure 4

a Input MRI, (b) annotations, feature map before (c) and after (d) DFE module, and (e) each image-channel output by the model corresponds to a segmentation area. Feature map of different skip-connection path with (f) and without (g) ST-SC module.

In order to establish lumbar IVD baseline data in a large population, Data Set C composed of 1051 mid-sagittal T2 lumbar-spine images with different age (20–90) and gender was used to extract data by segmentation model and quantitative method among four hospitals around China, including Longhua Hospital, Shanghai University of TCM, Guangdong Provincial Hospital of Chinese Medicine, Shenzhen Pingle Orthopedics Hospital, and Dongzhimen Hospital, Beijing University of Chinese Medicine between January 1, 2019 and March 30, 2021. The imaging parameters of all sites are summarized in Supplementary Table 9.

Proposed network model

Overview of the BianqueNet architecture

As presented in Fig. 4, mid-sagittal T2W lumbar MR images are input into the backbone (a resnet101 network34 that uses the atrous separable convolution to improve the last stage) for 16 times the downsampling, from which richer semantic information and more dense features are extracted through the depth feature extraction module (see Depth feature extraction section for details). To restore more detailed features of segmentation targets, the original quadruple upsampling operation in Deeplabv3+35 was modified with double upsampling, while the general bilinear interpolation was replaced to transpose convolution for upsampling. At the same time, input the feature maps of different resolutions obtained by downsampling to the Swin Transformer-skip connection module (see “Swin Transformer–skip connection” section for details), and then fuse the upsampled feature maps of the same resolution to obtain feature maps of different scales. According to Feature Pyramid Network36, multi-scale feature fusion module (MFF) is used to combine the feature maps with strong low-resolution semantic information and feature maps with weak high-resolution semantic information but rich spatial information. Then a 3*3 double-convolutional layer is used for the fused feature map to improve the feature, and finally a double upsampling operation is performed to obtain a dense prediction image.

Swin transformer–skip connection module

Applications with transformer-based vision backbones such as Vision Transformer (ViT) achieved innovative technological breakthrough in recent years37,38, in which Swin Transformer, a layered transformer based on shifted windows, makes it compatible with a broad range of vision tasks. Compared with earlier sliding-window-based self-attention approaches39,40, Swin Transformer performs higher efficiency and lower complexity. In this study, a skip-connection module was designed with two successive Swin Transformer blocks, called as ST-SC module. As shown in Fig. 4, the Swin Transformer block is consisting of a shifted-window-based multi-head self-attention (MSA) module and a 2-layer multi-layer perceptron (MLP) with GELU nonlinearity. A layer norm (LN) layer is applied before each MSA and MLP module, while a residual connection is applied after each module37. At the same time, a 1*1 convolutional layer is applied after two successive Swin Transformer blocks, and finally the two output features are spliced to provide more accurate and fine-grained structural information for upsampling. To avoid affecting dense-feature output from upsampling, the number of output channels for the ST-SC module is adjusted to 1/8 times that of the downsampled feature maps. Calculation formulas are shown as the following:

$${\overline{{{{{{\rm{SC}}}}}}}}_{0}=W{\mbox{-}}{{{{{\rm{MSA}}}}}}\left({{{{{\rm{LN}}}}}}\left(X\right)\right)+X$$
(1)
$${{{{{\rm{SC}}}}}}_{0}={{{{{\rm{MLP}}}}}}\left({{{{{\rm{LN}}}}}}\left({\overline{{{{{{\rm{SC}}}}}}}}_{0}\right)\right)+{\overline{{{{{{\rm{SC}}}}}}}}_{0}$$
(2)
$${\overline{{{{{{\rm{SC}}}}}}}}_{1}={{{{{\rm{SW}}}}}}{\mbox{-}}{{{{{\rm{MSA}}}}}}\left({{{{{\rm{LN}}}}}}\left({{{{{{\rm{SC}}}}}}}_{0}\right)\right)+{{{{{{\rm{SC}}}}}}}_{0}$$
(3)
$${{{{{\rm{SC}}}}}}_{1}={{{{{\rm{MLP}}}}}}\left({{{{{\rm{LN}}}}}}\left({\overline{{{{{\rm{SC}}}}}}}_{1}\right)\right)+{\overline{{{{{\rm{SC}}}}}}}_{1}$$
(4)
$$Y={{{{\varnothing }}}_{f}}({\hat{X}},{{{{{\rm{SC}}}}}}_{1})$$
(5)

where \(X\) and \(\hat{X}\) denote the down-sampled feature maps of different resolutions and their output of 1*1 convolution, respectively; \(\overline{{{{{{\rm{SC}}}}}}}\) and \({{{{{\rm{SC}}}}}}\) denote the output features of the \((S)W{\mbox{-}}{{{{{\rm{MSA}}}}}}\) module and the MLP module, respectively; W-MSA and SW-MSA denote window-based multihead self-attention using regular and shifted-window partitioning configurations, respectively37. \({{{\varnothing }}}_{f}\) is the feature-fusion function, and Y is the output of ST-SC module.

Compared with segmentation by deeplabv3+ without ST-SC module, the contour information of the vertebral body in the feature map of skip-connection middle path is more accurate, and the contour information of intervertebral disc and cerebrospinal fluid in the feature map of skip-connection high path is more accurate (Fig. 4f, g). Therefore, the ST-SC module may provide more accurate detailed information for the upsampling path by increasing the focus on target information in different spatial domains.

Depth feature extraction module

In this study, the depth-feature extraction module (DFE) is designed between backbone output section and upsampling sections. Feature-map output from the backbone is extracted-feature information of different depth through pooling operation with different scale in the pyramid pooling module41. Combining the fused global feature map with the backbone–output-feature map, a multiscale contextual information feature map of 4096 channels was obtained to further extract a dense semantic feature map of 256 channels through the atrous spatial pyramid pooling (ASPP) module35.

Weighted dice-loss function

A weighted dice-loss function as below was proposed to enhance segmentation performance by estimating the difficulties in different images with typical or atypical structure, which ensured consistency in segmentation:

$${L}_{{{{{\rm{wdice}}}}}}=\frac{1}{C}\mathop{\sum }\limits_{j=1}^{C}{\xi }_{j}\left(1-\frac{2{\sum }_{i=1}^{N}{p}_{1i}{g}_{1i}}{2{\sum }_{i=1}^{N}{p}_{1i}{g}_{1i}+{\sum }_{i=1}^{N}{p}_{0i}{g}_{1i}+{\sum }_{i=1}^{N}{p}_{1i}{g}_{0i}}\right)$$
(6)

This formula was used in the output of the SoftMax layer, where the p1i is the probability of voxel i (target) and p1i is the probability of voxel i (nontarget). So was for \({g}_{1i}\) and \({g}_{0i}\). j represents different segmentation areas, C represents the total number of channels, which is taken as 14. \(\xi\) represent the weight of different segmentation channels. According to the experimental analysis results, channel weight was set to 0.9, 0.8, and 1 for vertebral body, IVD, and the other, respectively, which may achieve the best segmentation performance.

For avoiding that the subsequent feature-extraction operations are affected, corrosion and expansion operations were used to remove the burrs (Fig. 4e).

Lumbar IVD quantitative analysis

Parameter calculation based on IVD-related area segmentation

Based on previous studies18,25,26,27, signal-intensity difference (∆SI) in IVD areas was used to quantify the blurring degree of boundary between NP and AF, which indicates water-content loss with IVD degeneration. Average disc height (DH), disc-height index (DHI), and disc height-to-diameter ratio (HDR) were used to quantify structural collapse with IVD degeneration (Fig. 5). Detailed quantitative methods were described as below.

Fig. 5: Scheme diagram of IVD parameter calculation.
figure 5

Signal-intensity histogram calculation: (a) vertebral body area, (b) presacral fat area, (c) cerebrospinal fluid area, and (d) intervertebral disc area. e The outline of the segmented area is displayed on the original image. f Schematic diagram of ΔSI calculation. g Signal-intensity histogram corresponding to different Pfirrmann grade. h A geometric calculation method of lumbar disc-height parameters based on area, (i) vertebral body corner detection result (red points) and feature-point calculation result (green points), and (j) 80% area-extraction result of the intervertebral disc center.

Signal-intensity histogram features

The histogram feature is used to quantify different signal-intensity distribution in different areas from MRI, in which the X axis represents different signal intensities, and the Y axis represents the corresponding number of pixels. A two-peak distribution has been analyzed in healthy IVD from MRI, because the sharpness of the boundary between the NP and the AF can be well characterized with large amounts of pixel with two major signal intensities (Fig. 5d)13. With IVD degeneration, water-content loss in NP can be measured in histogram-feature distribution changes, which presents that previous higher signal intensity (light) in the IVD area gradually becomes lower (dark) (Fig. 5g). The difference in pixel numbers corresponding to different signal intensities can well describe the degeneration state. To reduce the influence of individual differences and MR-imaging condition, a modified method was used in calculating the difference between the two peaks in IVD signal intensity histogram after being normalized with the peak signal intensity of CSF in the spinal canal (Fig. 5f). The calculation formula of the signal-intensity difference (ΔSI) between two peaks is shown as the following:

$${\triangle {{{{{\rm{SI}}}}}}}^{i}=\frac{{{{{{\rm{SI}}}}}}_{2}^{i}-{{{{{\rm{SI}}}}}}_{1}^{i}}{{{{{{\rm{SI}}}}}}_{{{{{\rm{CSF}}}}}}}\times 255$$
(7)

Among them, \({{{{{\rm{SI}}}}}}_{1}^{i}\) and \({{{{{\rm{SI}}}}}}_{2}^{i}\) respectively represent the signal-intensity values corresponding to the 1st and 2nd peaks of the histogram of the IVD, i represents the position of the ith IVD. SICSF represents the signal intensity corresponding to the peak of the histogram of the CSF area, and 255 is an amplification factor.

Vertebral body height

According to the channels of the segmented vertebral body, the Shi–Tomasi corner detection method was used to accurately point the four corner vertices (superior–anterior (\({L}_{{{{{\rm{sa}}}}}}^{i}\)), superior–posterior (\({L}_{{{{{\rm{sp}}}}}}^{i}\)), inferior-anterior (\({L}_{{{{{\rm{ia}}}}}}^{i}\)), and inferior-posterior (\({L}_{{{{{\rm{ip}}}}}}^{i}\))) of the vertebral body (Fig. 5h). The Euclidean distance between two midpoints (\({L}_{{{{{\rm{ma}}}}}}^{i}\) of \({L}_{{{{{\rm{sa}}}}}}^{i}\) and \({L}_{{{{{\rm{ia}}}}}}^{i}\), \({L}_{{{{{\rm{mp}}}}}}^{i}\) of \({L}_{{{{{\rm{sp}}}}}}^{i}\) and \({L}_{{{{{\rm{ip}}}}}}^{i}\)) is defined as the vertebral body diameter (Fig. 5h). The vertebral body diameter (VD) calculation formula is shown as the following:

$${{{{{\rm{VD}}}}}}^{i}=\sqrt{{\sum}_{j=1}^{2}{\left({{L}_{{{{{\rm{ma}}}}}}^{i}}_{j}-{{L}_{{{{{\rm{mp}}}}}}^{i}}_{j}\right)}^{2}}$$
(8)

where i denotes the ith vertebral body, in the range from 1 to 5, j denotes the midpoint coordinate dimension, values of 1,2.

The area of the vertebral body was calculated with the sum of all the pixel values of the vertebral body mask channel, and then the vertebral body height was obtained by dividing by VD. The vertebral body height (VH) calculation formula is shown as the following:

$${{{{{\rm{VH}}}}}}^{i}=\frac{1}{{{{{{\rm{VD}}}}}}^{i}}\mathop{\sum }\limits_{x=1}^{h}\mathop{\sum }\limits_{y=1}^{w}{P}_{{{{{\rm{xy}}}}}}$$
(9)

Among them, h and w respectively represent the height and width of the picture, Pxy represents the pixel value when the height coordinate is x and the width coordinate is y, and the value of Pxy is 0 or 1.

Disc height

In the field of IVD height calculation, previous study showed that using area-based quantitative-measurement method was better than using point-based method, in which the result with excellent reliability showed that IVD height was equal to 60% or 80% of IVD diameter in sagittal view14. Therefore, in this study, the lumbar IVD height was calculated as 80% of lumbar-disc diameter.

After the feature-location points being obtained (Fig. 5i), the area of the lumbar IVD was calculated with the sum of all the pixel values between the two-line segments (Fig. 5j), while the lumbar IVD height was obtained by dividing by the lumbar IVD diameter. The IVD height (DH) calculation formula is shown as the following:

$$D{H}^{i}=\frac{1}{\mu ||{D}_{a}^{i}{D}_{p}^{i}|{|}^{i}}{\sum }_{x=\min{X}_{D}}^{\max \,{X}_{D}}{\sum }_{y=\,\min \,{Y}_{D}}^{\max{Y}_{D}}{P}_{xy}$$
(10)

Among them, μ represents the percentage of the center area of the entire lumbar IVD, taken as 80%, \({{||}{D}_{a}^{i}{D}_{p}^{i}{||}}^{i}\)represents the diameter of the ith lumbar IVD, and XD and YD represent the width and height coordinate sets of the four characteristic location points respectively. \(\{{x}_{{D}_{1a}^{i}},{x}_{{D}_{2a}^{i}},{x}_{{D}_{1p}^{i}},{x}_{{D}_{2p}^{i}}\}\)\(\{{y}_{{D}_{1a}^{i}},{y}_{{D}_{2a}^{i}},{y}_{{D}_{1p}^{i}},{y}_{{D}_{2p}^{i}}\}\)

Disc-height index

To reduce individual differences, disc-height index (DHI) was used as normalized geometric parameter. Once the angle of the vertebral body and the midpoint of the endplate marked, the measurement line was drawn according to the marked point15. The DHI calculation formula is shown as the following:

$${{DHI}}^{i}=\frac{2\times {{DH}}^{i}}{{{VH}}^{i}+{{VH}}^{i+1}}$$
(11)

Among them, DHi represents the height of the ith lumbar IVD, and VHi and VHi+1 respectively represent the height of the ith and the (i+1)th vertebral body.

Disc height-to-diameter ratio

Disc height-to-diameter ratio (HDR) is proposed to simultaneously characterize the height and shape of the IVD, which is considered to be the most accurate and repeatable30. In this study, the maximum IVD diameter was obtained by feature-location points, while average IVD height was calculated using the area-based method. Therefore, HDR calculation formula is shown as the following:

$${{HDR}}^{i}=\frac{{{DH}}^{i}}{{{{{{{\rm{||}}}}}}{D}_{{al}}^{i}{D}_{{pr}}^{i}{{{{{\rm{||}}}}}}}^{i}}$$
(12)

where \({{||}{D}_{{al}}^{i}{D}_{{pr}}^{i}{||}}^{i}\) represents the maximum diameter of the ith lumbar intervertebral disc.

IVD-degeneration quantitation

Signal-intensity peak-deviation degree

With IVD-degeneration process, water-content loss can be reflected in signal-intensity changes and height decrease. In this study, the signal-intensity peak-deviation degree from the center (ΔSI) was mainly calculated to describe the water-content loss status with IVD degeneration. Based on IVDs with modified Pfirrmann grade (levels 1, 2, 3, 4, and 5–8), mean and standard deviation of the standard signal-intensity peak difference (ΔSI) of each grade were established as grading standard to quantitatively analyze IVD degeneration (Supplementary Fig. 4). The calculation formula is shown as the following:

$$\triangle =\frac{{{{{{\rm{||}}}}}}\triangle {{{{{{\rm{SI}}}}}}}-{\mu }_{i+1}{{{{{\rm{||}}}}}}}{{\sigma }_{i+1}}-\frac{{{{{{\rm{||}}}}}}\triangle {{{{{{\rm{SI}}}}}}}-{\mu }_{i}{{{{{\rm{||}}}}}}}{{\sigma }_{i}}$$
(13)

where ΔSI is the current peak signal-intensity difference of the IVD, μi and σi are the mean and standard deviation of the standard signal-intensity peak difference of the ith level, and i is 1–4.

Quantitative analysis on IVD degeneration

For the original parameter DH, the ratio of the current DH to the average DH of the corresponding healthy intervertebral disc was used to calculate the collapse percentage. For the nonoriginal parameters ΔSI, DHI, and HDR, which involve the influence of variables such as vertebral body height and intervertebral disc diameter, the same method as J.Jarman et al. is used to calculate the degree of deviation from the range center of the corresponding healthy intervertebral disc-height parameter15:

$${\beta }_{k}=\frac{X-{\mu }_{i}^{j}}{{\sigma }_{i}^{j}}$$
(14)

where βk represents the degree of deviation between the kth nonoriginal parameter and the mean value of the corresponding healthy intervertebral-disc parameter, and k is from 1 to 3. When \({\beta }_{k}\) is smaller, the degree of signal intensity of the intervertebral disc degenerates or the collapse is higher. j represents gender, which is 0 or 1, and i represents structure position.

Evaluation of model performance

Accuracy evaluation on IVD-segmentation performance

To ensure that the model trained with images from one hospital may present equally good accuracy in segmentation for all images from other three hospitals, 20 images randomly selected from each hospital were used to test.

Dice index and Intersection over Union (IOU) were used to measure the similarity between the segmented IVD-related areas and the manual labeled boundaries.

Consistency evaluation on IVD-parameter quantitation

In our study, accuracy in segmentation performance and consistency in IVD quantitative analysis are equally important. To evaluate consistency in IVD quantitative analysis in different resolution, 46 MR images with resolution of 320*320 were randomly selected from Data Set B to be segmented and quantified by model B. Meanwhile, these images were adjusted to 512*512 for segmentation and quantitation by model A. If IVD parameters extracted from model A and model B show good consistency, model A (trained with resolution of 512*512) will be considered applicable enough to extract IVD parameters among a larger population (Data Set C) with different machines.

Although manual measurement may present a greater error and lower consistency than machine measurement, IVD parameters measured by a senior radiologist and orthopedic residents are important as control standard. A 4th-year radiology resident (DW Kong), and a 4th-year orthopedic resident (MC Yin) measured and calculated all the IVD parameters (HDR and DHI) among these 15 MR images randomly selected from Data Set B. Each IVD was measured and recorded three times (Supplementary Fig. 1), from which mean values of three-time measurements were used to compare with each other. In addition, to avoid fatigue in long-term measurement, these residents were asked to have a 20-minute rest after measuring every two MR images.

The intraclass correlation coefficient (ICC) was used to analyze the consistency between the IVD-parameter extraction and IVD manual measurement. Mean time spent on each IVD quantitation was used to describe efficiency.

Validity evaluation on IVD-degeneration quantitation

To test the validity of signal-intensity quantitation on IVD degeneration, 46 MR images randomly selected from Data Set A and Data Set B, respectively, were used to automatically grade IVD-degeneration levels. Meanwhile, a research team, composed of a 4th-year radiology resident (DW Kong), two 8th-year orthopedic resident (J Chen, XF Ma), and three 4th-year orthopedic residents (YL Sun, YP Lin, and MC Yin), graded all the IVD-degeneration levels independently according to the modified Pfirrmann grading system10. They were all blinded to the automatic quantitative measures. Disagreements were resolved by consensus with additional two 10th-year orthopedic residents (XJ Cui and YJ Wang). MacroF1 score was used to analyze the validity between the automatic grade results and final manual grade results.

Baseline characteristic of IVD parameters in a large population

A retrospective study was conducted at four hospitals around China, in which the study population composed of patients who completed lumbar-spine MRI examination between January 1, 2019 and March 30, 2021. Further screening was conducted to exclude IVD herniation, lumbar spondylolisthesis, spine tumors, and severe ossification in vertebral bodies, whose abnormal signal-intensity distribution or irregular boundaries in IVD-related areas may enhance heterogeneity in IVD-degeneration parameters. The screened MR images (Data Set C) were finally used to determine the relationships of baseline variables (age, gender, segments, and degeneration grades) and IVD quantitative parameters (ΔSI, DH, DHI, and HDR).

Statistical analysis

For performance evaluation in IVD segmentation and quantitation, the intraclass correlation coefficient (ICC) was used to analyze the consistency in IVD quantitation. The Dice coefficient and the Intersection over Union (IOU), also known as the Jaccard index, were used to evaluate the segmentation performance of the model. They are given by the following expression:

$${{{{{\rm{mDice}}}}}}=\frac{1}{C}\mathop{\sum }\limits_{i=1}^{C}\frac{2* {{{{{\rm{|}}}}}}{G}_{i}\cap {P}_{i}{{{{{\rm{|}}}}}}}{\left|{G}_{i}\right|+{{{{{\rm{|}}}}}}{P}_{i}{{{{{\rm{|}}}}}}}$$
(15)
$${{{{{\rm{mIOU}}}}}}=\frac{1}{C}\mathop{\sum }\limits_{i=1}^{C}\frac{{{{{{\rm{|}}}}}}{G}_{i}\cap {P}_{i}{{{{{\rm{|}}}}}}}{{{{{{\rm{|}}}}}}{{G}_{i}\cup P}_{i}{{{{{\rm{|}}}}}}}$$
(16)

where Gi is the ground-truth annotation and Pi is segmentation result for the ith segmentation area, and C takes 14, indicating 14 segmented areas.

For IVD characteristic analysis, the mean and standard deviation were calculated for continuous variables and frequency and proportion for categorical variables. The following test was used: T-test, Mann–Whitney U for continuous variables, Chi-square for nominal variables, and Spearman rank correlation. MLR was carried out to determine the relationships of baseline variables (age, gender, segments, and degeneration grades) and IVD quantitative parameters (ΔSI, DH, DHI, and HDR). Spearman rank correlation analysis was used to investigate the correlation between IVD signal intensity and degeneration grades. The macroF1 score and the Kendall correlation coefficient were used to analyze the validity in IVD-degeneration grading performance.

An absolute value of r of 0–0.4 was considered as weak correlation, 0.4–0.6 as moderate correlation, and greater than 0.6 as strong correlation. p-value of <0.05 was considered statistically significant. The calculations were made using IBM SPSS Statistics (version 26, IBM, USA) and Stata (version 15.1, USA).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.