Introduction

Solitary pulmonary nodules (SPNs) are common clinical findings, often incidental, that may represent malignant disease in the lung. SPNs are defined as a single, well defined pulmonary nodule with a diameter less than 3 cm and surrounded by normal lung tissue that is not associated with atelectasis or adenopathy1. Causes of SPN range from lung cancer and metastatic cancer to infections, scar formation, and other benign lesions. Previous studies showed that pulmonary nodules were detected in 69% of patients who underwent lung cancer screening with low-dose CT2, and 53% of SPNs were found to be malignant nodules3. Lung cancer usually presents as a SPN on diagnostic imaging at early stages of the disease1. Accurate classification of SPNs is important clinically, as diagnosis of malignant disease at this early timepoint is critical for improving the success of surgical resection and increasing 5-year survival rates.

18F- fluorodeoxyglucose (18F-FDG) PET has been demonstrated its utility for SPNs diagnosis with a high sensitivity to malignant SPNs detection; however, the application of FDG PET/CT is limited by its variable specificity estimates4. 18F-FDG, a PET tracer of glucose metabolism, has shown significant difference in uptake between malignant and benign lesions. Standard uptake value (SUV), the amount of tracer uptake inside the nodules, has been demonstrated good quantitative repeatability in lung nodules. Using SUVmax above 2.5 as a diagnostic threshold for malignant SPNs has been reported5, however, use of FDG PET/CT is less specific in diagnosing malignancy in populations with endemic infectious lung disease as compared with non-endemic regions. In order to improve the specificity of FDG PET/CT, some authors have proposed dual time point imaging (DTPI), using retention index (RI) to help differentiate benign and malignant SPNs. Unfortunately, the results of DTPI studies also showed varied sensitivity/specificity comparisons to single time point imaging, with both positive6, 7 and negative findings8, 9 being reported. There is an urgent need to improve the accuracy and specificity in diagnosing malignancy in populations from granuloma-endemic regions.

Beyond evaluating lesion SUVmax, the spatial distribution of FDG also contains important information. Uptake of FDG is not homogeneously distributed within the lesions. Many factors, such as cellular proliferation, necrosis, blood flow and hypoxia, may contribute to intra-lesion heterogeneity10. Thus, measurements of this heterogeneity might help to distinguish benign from malignant pulmonary nodules.

Characterization of uptake heterogeneity is gaining popularity through radiomics-based analysis that extracts high throughput features based on intensity, shape, and texture of uptake within regions of interest. In CT imaging, the use of quantitative heterogeneity metrics for diagnostic purpose has been well studied, showing promising results in various cancer types11, 12. In FDG PET/CT, the use of texture features has improved discrimination between abnormal tissues from normal tissue for lesion delineation. Texture features derived from Neighboring Gray Tone Difference Matrix13, describing features such as coarseness, contrast and busyness on PET images have shown the ability to differentiate tumor from normal tissue in head and neck cancer14. However, there are only a few studies looking at the diagnostic value of quantitative heterogeneity features in FDG PET/CT imaging15, 16. Whether texture features can be able to discriminate malignant from benign lesions is still unknown. Furthermore, there are no studies evaluating the use of quantitative heterogeneity in DTPI PET/CT images for SPN differentiation.

This main goal of this study was to assesses the use of quantitative heterogeneity features extracted from DTPI images for differentiating malignant from benign SPN lesions in a population from granuloma-endemic regions. Machine learning models using texture features from DTPI PET/CT were tested and compared against commonly used clinical metrics and visual interpretation.

Results

Patient population

In total, 149 SPN patients underwent DTPI FDG PET/CT scans between 2004 and 2014 were reviewed. 64 patients were excluded from the study as the metabolic volume of the SPNs was smaller than 5 mL, making texture feature analysis unreliable. As a result, 85 patients (mean age: 61.58 ± 11.95, 56 male) were included in this study. Lesion diagnoses were confirmed by pathology (n = 74 lesions) or follow-up imaging (median: 14 months, range: 12–34 months) (n = 11 lesions). Sixty-three lesions were malignant nodules, with 61 confirmed by pathology results and 2 with evidence of metastasis in follow up period and subsequently clinically diagnosed as lung cancer. Twenty-two lesions were diagnosed to be benign lesions, 13 confirmed by pathology results, 4 diagnosed as stable nodules, 5 diagnosed as reduced nodules after treatment with antibiotics. Table 1 summarizes diagnoses of SPNs.

Table 1 Diagnosis of SPNs.

Comparison of machine learning models and clinical metrics

The discriminating power of each model was compared by Area Under the Curve (AUC) of the Receiver Operating Characteristic curve (ROC) (Fig. 1). The ROC analysis showed that the AUC of standard clinical metrics, including early SUVmax, delayed SUVmax, RI, and visual interpretation were 0.77, 0.77, 0.56 and 0.76, respectively (Table 2). Of these clinical metrics, RI performed the worst with significantly lower AUC than both early SUVmax and visual interpretation (p = 0.01 and p = 0.02, respectively). Support Vector Machine (SVM) models of quantitative features showed AUCs from dPET (delayed PET) model and edPET/CT (early and delayed PET/CT) model of 0.90 and 0.91, respectively; which were significantly larger than AUC of either early SUVmax (P = 0.02, P = 0.01) or visual interpretation (P = 0.03, P = 0.04). Compared to either early SUVmax or visual interpretation, ePET (early PET) model and ePET/CT (early PET/CT) model showed moderate improvements, and eCT (early CT) model showed a little decline, but the differences were not significant (Table 2).

Figure 1
figure 1

ROC curves of SVM models, early SUVmax, visual interpretation and retention index. Areas under curve showed the ability of machine learning models, early SUVmax, visual interpretation and retention index to distinguish malignant from benign SPNs. The dPET model and edPET/CT model had a significant improvement in discriminating power than early SUVmax, visual interpretation and retention index.

Table 2 AUC of ROC Analysis for each model.

The performance of each model and indexes of the optimal thresholds were calculated and compared (Table 3). By using the optimal threshold, dPET and edPET/CT models achieved the best specificity, accuracy and positive predictive value. Two of the clinical indices, early SUVmax and RI, had a higher sensitivity than all SVM models, but the specificity was low.

Table 3 Diagnostic values for differentiation of malignant and benign SPN lesions with SVM models and indexes.

Table 4 summarizes the frequency of features selected by models more than 2 times. Delayed Busyness had the highest frequency of selection, being chosen in all 5/5 cross validations of the combined edPET/CT model and 3/5 dPET models. The most commonly selected feature-types are “Busyness”, “Coarseness” and “Cluster Prominence” on either delayed or early PET.

Table 4 Frequency of most commonly selected features for each model.

Discussion

In this study, we were able to demonstrate significant improvement in classifying malignant from benign SPNs using machine learning techniques. The machine learning algorithms were trained on quantitative heterogeneity features from DTPI PET and CT images and results were shown to be superior to commonly used clinical metrics and visual interpretation. Using an SVM model created with DTPI FDG PET/CT features, we were able to achieve higher diagnostic accuracy, marked by higher specificity and sensitivity, than commonly used clinical metrics and visual interpretation. This is the first report of using texture features for diagnosis of solitary pulmonary nodules, and the first study to evaluate texture features in delayed PET images. The improvement in diagnostic performance shown in this study could potentially benefit patients by preventing unnecessary invasive tests following false-positive findings or providing earlier detection and intervention in patients with malignant disease.

We compared the diagnostic accuracy of SVM models with that of two experienced physicians. The diagnostic ability of SVM created from multiple early FDG PET/CT texture features was comparable to two experienced physicians. With the addition of delayed PET features, the SVM models showed greater diagnostic accuracy compared to physician assessment and common clinical metrics. Therefore, SVM analysis from DTPI FDG PET/CT might play an integral role as a semi-automated tool to supplement diagnostic classifications for physician readers.

Previous studies have reported success using features from high quality diagnostic CT in classification of lung cancer12. Results from our study suggest CT texture features from free breath CT images of FDG PET/CT may not provide more information than PET features along for classification purposes. Models using only CT features can be used in SPNs classification, but the differentiation ability is weak, and no improvements were seen compared to using only early SUVmax. Additionally, the AUC and accuracy for the ePET model and the ePET/CT model are similar, suggesting that the addition of early CT features to early PET features did not improve the diagnosis ability significantly.

FDG PET/CT has been demonstrated as a useful, yet incomplete, tool for SPN diagnosis. One of the main limitations is that many benign lesions demonstate high FDG uptake which lead to false-positive results. Active infectious and non-infectious inflammatory etiologies can lead a high uptake in FDG PET. There are numerous reports in the literature of false-positive findings on FDG PET imaging due to granulomas and tuberculosis17. Use of FDG PET/CT was less specific in diagnosing malignant lesions in populations with endemic infectious lung disease compared with non-endemic regions4. The average adjusted specificity in regions with endemic infectious lung disease is 16% lower, compared with nonendemic regions4. Improving the accuracy and specificity of FDG-PET/CT in diagnosing malignancy especially in populations from granuloma-endemic regions, which would be expected to have a higher rate of false-positive FDG PET/CTs due to uptake in benign inflammatory granulomas, is a difficult problem. In this study, both visual interpretation and early SUVmax had a low specificity and accuracy in diagnosis malignant SPNs, but SVMs using texture features extracted from both timepoints in DTPI FDG PET/CT images achieved higher specificity and accuracy over clinical metrics and visual interpretation. This support the hypothesis that intratumoral heterogeneity of FDG uptake was useful for discriminating benign from malignant nodule for the patients from granuloma-endemic regions.

Materials and Methods

This retrospective study was approved by the Medical Science Ethic Committee of the 1st hospital of China Medical University, and formal consent was not required. All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional. This study was retrospective, and its results did not influence further therapeutic decision-making.

Patients

DTPI 18F-FDG PET/CT scans were performed between 2004 and 2014 at a single center in a granuloma-endemic region. The scans were retrospectively reviewed and those scans with a Single Pulmonary Nodule (SPN) were analyzed. Diagnosis of malignant or benign disease was primarily established by pathology review following biopsy or surgical resection of the nodule within one month after the FDG PET/CT acquisition. When biopsy or surgery was not performed, patients were followed for at least 1 year after the FDG PET/CT scan, using chest radiography or CT. If the SPN was stable during this period or reduced with antibiotics treatment, they were considered benign. Patient without pathological confirmation or those receiving surveillance follow-up imaging less than 1 year were removed from this study. In total, FDG PET/CT data from 149 patients was reviewed under these criteria.

Image acquisition and reconstruction

Patients fasted for at least 6 h, and blood glucose levels were measured before being injected with 5.55MBq/kg (0.15 mCi/kg) 18F-FDG. Early and Delayed FDG PET/CT acquisition started 60 min and 180 min post injection using a GE discovery LS 4 PET/CT scanner. Emission data were acquired for 3 min per bed position in 2D mode. The PET images were reconstructed using an iterative algorithm (ordered-subsets expectation maximization: 2 iterations, 28 subsets) with an 8-mm Gaussian filter, a 128 × 128 matrix and 4.25 mm/slice. Transmission scanning proceeded under the following parameters.

Image Analysis

Nodules on both early FDG PET (ePET) and delayed FDG PET (dPET) images were identified and segmented by two experienced physicians (with more than 5 years experiences working in nuclear medicine). Discordant segmentations were resolved by discussion and mutual consensus. Lesions with volume of interest (VOI) smaller than 5 mL on the early FDG PET/CT images were removed from this study, following recommendations from previous studies showing FDG PET texture features uncertainty in small volumes18, 19.

For each lesion, two quantitative clinical metrics, SUVmax and Retention Index(RI), were calculated.

RI was calculated according to equation (1).

$${\rm{RI}}=100 \% \times \frac{({\rm{delayed}}\,{{\rm{SUV}}}_{{\rm{\max }}}-{\rm{early}}\,{{\rm{SUV}}}_{{\rm{\max }}})}{{\rm{early}}\,{{\rm{SUV}}}_{{\rm{\max }}}}$$
(1)

Additionally, for each lesion in this study, a visual interpretation score was made based on DTPI FDG PET/CT images (both early and delayed FDG PET/CT images). All images were interpreted by two physicians. Image interpretation was performed without knowledge of patient diagnosis to avoid potential interpretation bias. A 5-point scale was used when readers interpreted the images, which represent the likelihood of the lesion being benign or malignant, from: 1, definitely benign; 2, probably benign; 3, equivocal; 4, probably malignant; and 5, definitely malignant. If the interpretation scores were discordant between two readers, they were subsequently discussed to arrive to a consensus.

The following interpretive criteria20 were utilized:

CT image interpretation

Each nodule was characterized in terms of its attenuation, shape, and margin characteristics using previously well-described criteria developed by the American College of Radiology Imaging Network (ACRIN) for its lung cancer screening trials21, 22.

FDG PET image interpretation

FDG PET images interpretation was based on the degree and the distribution of the uptake in the lesion on the early FDG PET images as well as the contrast changes apparent on the delayed FDG PET image. If lesion uptake was higher than blood pool in the early FDG PET and appeared distinctly focal and its contrast increased in delayed FDG PET, the lesion was categorized as more likely to be malignant. In contrast, if the lesion was poorly defined with a low uptake in the early FDG PET and lost contrast in delayed FDG PET, it was interpreted as more likely to be benign. On the basis of integration of FDG PET characteristics (uptake degrees, distribution of uptakes, spatial volume effect, and contrast changes) and CT characteristics (attenuation, shape, and margin characteristics) the readers then made a 5-point scale score.

Quantitative Texture-based Analysis

Texture features were extracted following voxel-based methodology described previously23, 24. Uptake values contained within the Volume of Interest (VOI) were resampled prior to texture feature extraction using a 256-bin discretization. For each voxel inside the VOI, a patch was extracted, defined as a portion of the image with 5 × 5 × 5 (axial, coronal, sagittal) voxels in size, centered on that voxel. Texture features were computed on these patches in 4 angular directions on axial, coronal and sagittal slices and then the mean value was calculated. For each lesion, 59 features were extracted respectively on early FDG PET images (ePET), delayed FDG PET images (dPET) and CT (CT) images with an in-house code developed in MATLAB, using methodology as described by Galavis et al.24, adapted to include all nearest neighbors of each voxel. This methodology of 3D extraction was motivated by previous works showing increasing the number of neighboring planes did not increase observed performance of extracted spatial information25, 26. Those features included: 10 histogram based first order features, eight first order features, 22 features based on the co-occurrence matrix, 11 features based on the gray level run length matrix, 5 features based on the neighboring gray level and 3 features based on the neighborhood gray tone difference matrix (Table 5).

Table 5 Texture features.

In total, 177 features were calculated for each lesion: 59 features from ePET images, 5, 59 features from dPET images and 59 features from CT images.

Features selection and testing machine learning models

Five-fold cross validation was used in this study to divide the original data into training data and validation data. By using 5-fold cross-validation27, the original data is randomly partitioned into 5 equally sized subsamples. Of the 5 subsamples, a single subsample is retained as the validation data for testing the model, and the remaining 4 subsamples are used as training data. The cross-validation process is then repeated 5 times, with each of the 5 subsamples used exactly once as the validation data. The 5 results from the folds can then be averaged to produce a single estimation.

In order to reduce the size of the dataset, sequential forward floating selection (SFFS) was used to select only a few critical features to training the Support Vector Machine (SVM) models28. The SFFS was performed in the training dataset and the maximal number of selected features was set to five. Five SVM models were built with selected features from different feature sets: (1) ePET model: early PET features, (2) eCT model: early CT features, (3) dPET model: delayed PET features, (4) ePET/CT model: early PET/CT features, (5) edPET/CT model: early PET/CT features and delayed PET features. The SVM models were built by several built-in functions in Matlab (ver.2015b) using the default parameters.

The performance of each machine learning model to perform classification of unseen SPNs lesion into the benign or malignant was tested using receiver operating characteristics (ROC) analysis.

Comparison the performance of models with clinical metrics and visual interpretation

The performance of each model (SVM models, clinical metrics and visual interpretation) was evaluated using the areas under ROC curves (AUC), diagnostic accuracy, sensitivity, specificity, positive prediction value, and negative prediction value. AUCs were compared using Delong’s test29. Values plotted nearest the upper left corner of the ROC plot were considered to be the optimal threshold for diagnosis. The diagnostic accuracy, sensitivity, specificity were calculated using the optimal threshold and commonly accepted clinical metrics (early SUVmax > 2.5 and RI > 10% for malignant lesions).

Conclusions

The intratumoral heterogeneity of FDG uptake was useful for discriminating benign from malignant nodules in larger SPNs especially on delayed PET images for the patients from granuloma-endemic regions. Texture features on FDG DTPI provided different types of information that should be used to supplement SUVmax for making a diagnosis. SVMs and texture features extracted from DTPI FDG PET/CT images showed a significant improvement in discriminating benign from malignant nodules over commonly used clinical metrics and visual interpretation.