Introduction
Background of the study
Aims and significance
Research hypothesis
Materials and methods
Data description
Variable selection and rationale
Input feature | Type | Missing instances (Hong Kong Cohort only) | Handling technique for missing data |
---|---|---|---|
Age | Continuous | 0 | NA |
Sex | Boolean | 0 | NA |
Tobacco smoking | Boolean | 2 | Binarization of variables during feature engineering |
Alcohol drinking | Categorical (nominal) | 33 | |
Risk habit indulgence following diagnosis | Categorical (nominal) | 0 | NA |
Previous malignancy | Categorical (nominal) | 0 | NA |
Charlson Comorbidity Index (CCI) | Continuous | 0 | NA |
Hypertension status | Boolean | 0 | NA |
Diabetes Mellitus status | Boolean | 0 | NA |
Hyperlipidemia status | Boolean | 0 | NA |
Autoimmune disease status | Boolean | 0 | NA |
Viral hepatitis status | Boolean | 0 | NA |
Type of lesion | Boolean | 0 | NA |
Clinical subtype of lichenoid lesion | Categorical (nominal) | 0 | NA |
Tongue/FOM involved | Boolean | 0 | NA |
Labial/buccal mucosa involved | Boolean | 0 | NA |
Retromolar area involved | Boolean | 0 | NA |
Gingiva involved | Boolean | 0 | NA |
Palate involved | Boolean | 0 | NA |
Number of lesions | Categorical (ordinal) | 0 | NA |
Presence of ulcers or erosions | Boolean | 0 | NA |
Presence of induration | Boolean | 0 | NA |
Treatment at diagnosis | Categorical (nominal) | 0 | NA |
Recurrence after surgical excision | Boolean | 0 | NA |
Number of recurrences | Categorical (ordinal) | 0 | NA |
Oral epithelial dysplasia at diagnosis | Categorical (nominal) | 0 | NA |
Oral epithelial dysplasia detected during follow-up | Categorical (nominal) | 0 | NA |
Model development
Data preprocessing and feature engineering
Machine learning algorithms considered
Model training
Model validation
Performance measures
Explainability and net benefit analyses
Web-based application for future beta testing
Computation
Results
Patients’ description
Variables | First patient group (2003 – 2019) | Second patient group (2020) |
---|---|---|
N = 716 | N = 58 | |
N (%) | N (%) | |
Median age at diagnosis (IQR) | 58 (49–67) | 61.5 (53.8–68.3) |
Gender | ||
Female | 401 (56.0) | 33 (56.9) |
Male | 315 (44.0) | 25 (43.1) |
Patient category | ||
NSND | 469 (65.5) | 41 (70.7) |
SD | 247 (34.5) | 17 (29.3) |
Continued risk habits following diagnosis | ||
Yes | 14 (2.0) | 11 (19.0) |
No | 167 (23.3) | 1 (1.7) |
Not applicable | 535 (74.7) | 46 (79.3) |
Previous malignancy | ||
Head and neck tumors | 21 (2.9) | 0 |
Other tumors | 46 (6.4) | 3 (5.2) |
Hematologic malignancies | 23 (3.2) | 6 (10.3) |
No malignancy | 626 (87.4) | 49 (84.5) |
Charlson comorbidity index–mean (SD) | 0.72 (1.01) | 0.64 (1.02) |
Hypertension | 22 (37.9) | 211 (29.5) |
Diabetes mellitus | 9 (15.5) | 111 (15.5) |
Hyperlipidemia | 21 (36.2) | 122 (17.0) |
Autoimmune disease | 3 (5.2) | 42 (5.9) |
Viral hepatitis infection | 3 (5.2) | 69 (9.6) |
Lesion | ||
Oral leukoplakia | 389 (54.3) | 41 (70.7) |
Oral lichen planus/lichenoid lesion | 327 (45.7) | 17 (29.3) |
Clinical subtype of lichenoid lesion | ||
Reticular/Papular | 100 (14.0) | 4 (6.9) |
Erosive/Atrophic | 142 (19.8) | 6 (10.3) |
Plaque | 85 (11.9) | 7 (12.1) |
Tongue/FOM | 245 (34.2) | 25 (43.1) |
Buccal/Labial mucosa | 407 (56.8) | 27 (46.6) |
Retromolar area | 26 (3.6) | 3 (5.2) |
Gingiva | 88 (12.3) | 2 (3.4) |
Palate | 23 (3.2) | 3 (5.2) |
Number of lesions | ||
Single | 469 (65.5) | 44 (75.9) |
Bilateral or double | 210 (29.3) | 10 (17.2) |
Multiple | 37 (5.2) | 4 (6.9) |
Presence of ulcers or erosions | 228 (31.8) | 19 (32.8) |
Induration | 47 (6.6) | 5 (8.6) |
Treatment | ||
Surgical intervention | 221 (30.9) | 20 (34.5) |
Pharmacological treatment | 195 (27.2) | 7 (12.1) |
No treatment | 300 (41.9) | 31 (53.4) |
Post-excision recurrence | 42 (19.0) | 2 (3.4) |
Number of recurrences | ||
1 | 30 (13.5) | 2 (3.4) |
2 | 7 (3.2) | 0 |
3 | 4 (1.8) | 0 |
4 | 1 (0.5) | 0 |
Oral epithelial dysplasia at diagnosis | ||
Absent | 641 (89.5) | 48 (82.8) |
Mild | 34 (4.7) | 6 (10.3) |
Moderate | 27 (3.8) | 0 |
Severe | 7 (1.0) | 0 |
Unknown (defaulted biopsy at diagnosis) | 7 (1.0) | 4 (6.9) |
Oral epithelial dysplasia at follow-up | ||
Absent | 658 (91.9) | 48 (82.8) |
Mild | 11 (1.5) | 0 |
Moderate | 15 (2.1) | 1 (1.7) |
Severe | 24 (3.4) | 7 (12.1) |
Unknown (defaulted biopsy during follow-up) | 8 (1.1) | 2 (3.4) |
Malignant transformation | 76 (10.6) | 6 (10.3) |
AJCC TNM stage | ||
Stage I | 47 (6.6) | 3 (5.2) |
Stage II | 9 (1.3) | 2 (3.4) |
Stage III | 6 (0.8) | 0 |
Stage IV | 12 (1.7) | 0 |
Tumor grade | ||
Well differentiated | 23 (3.2) | NA |
Moderate differentiated | 30 (4.2) | |
Poorly differentiated | 3 (0.4) | |
Tumor prognosis | ||
Remission | 58 (8.1) | 4 (6.9) |
Recurrence | 6 (0.8) | 2 (3.4) |
Cancer-related death | 6 (0.8) | 0 |
Second primary tumor | 6 (0.8) | 0 |
Predictive performance of classifiers
Logistic regression
Algorithms | Imbalanced class technique | SMOTE | ADASYN | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | Training | Testing | Training | Testing | |||||||||||||||
Performance measures | Mean accuracy | SD | Range | Accuracy | Sensitivity | Precision | F1-score | SP | NPV | Mean accuracy | SD | Range | Accuracy | Sensitivity | Precision | F1-score | SP | NPV | |
Logistic regression | 0.89 | 0.036 | 0.81–0.95 | 0.88 | 0.75 | 0.39 | 0.51 | 0.89 | 0.98 | 0.88 | 0.043 | 0.81–0.93 | 0.92 | 0.67 | 0.53 | 0.59 | 0.95 | 0.97 | |
Linear SVM | 0.90 | 0.027 | 0.84–0.95 | 0.87 | 0.75 | 0.38 | 0.50 | 0.89 | 0.97 | 0.90 | 0.051 | 0.83–0.98 | 0.95 | 0.67 | 0.73 | 0.70 | 0.98 | 0.97 | |
RBF-Kernel SVM | 0.92 | 0.041 | 0.83–0.98 | 0.92 | 0.50 | 0.55 | 0.52 | 0.96 | 0.95 | 0.93 | 0.027 | 0.88–0.97 | 0.92 | 0.33 | 0.57 | 0.42 | 0.98 | 0.94 | |
Random forest | 0.89 | 0.029 | 0.82–0.92 | 0.87 | 0.67 | 0.35 | 0.46 | 0.89 | 0.97 | 0.90 | 0.033 | 0.83–0.94 | 0.91 | 0.67 | 0.47 | 0.55 | 0.93 | 0.97 | |
Decision tree | 0.81 | 0.038 | 0.72–0.85 | 0.71 | 0.75 | 0.19 | 0.31 | 0.71 | 0.97 | 0.82 | 0.056 | 0.73–0.92 | 0.95 | 0.75 | 0.69 | 0.72 | 0.97 | 0.98 | |
Gradient boosting | 0.91 | 0.030 | 0.83–0.95 | 0.90 | 0.75 | 0.43 | 0.56 | 0.91 | 0.98 | 0.90 | 0.04 | 0.83–0.95 | 0.95 | 0.67 | 0.73 | 0.70 | 0.98 | 0.97 | |
kNN | 0.89 | 0.025 | 0.85–0.94 | 0.87 | 0.42 | 0.29 | 0.35 | 0.91 | 0.94 | 0.90 | 0.032 | 0.83–0.94 | 0.83 | 0.42 | 0.23 | 0.29 | 0.87 | 0.94 | |
MLP-BP | 0.82 | 0.039 | 0.75–0.89 | 0.94 | 0.75 | 0.60 | 0.67 | 0.95 | 0.98 | 0.85 | 0.066 | 0.71–0.96 | 0.76 | 0.21 | 0.21 | 0.32 | 0.77 | 0.96 | |
LDA | 0.89 | 0.034 | 0.82–0.94 | 0.87 | 0.67 | 0.35 | 0.46 | 0.89 | 0.97 | 0.89 | 0.049 | 0.79–0.95 | 0.93 | 0.58 | 0.58 | 0.58 | 0.96 | 0.96 |
Algorithms | Imbalanced class technique | SMOTE | ADASYN | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | Training | Testing | Training | Testing | |||||||||||||||
Performance measures | Mean accuracy | SD | Range | Accuracy | Sensitivity | Precision | F1-score | SP | NPV | Mean accuracy | SD | Range | Accuracy | Sensitivity | Precision | F1-score | SP | NPV | |
Logistic regression | 0.89 | 0.033 | 0.84–0.96 | 0.84 | 0.67 | 0.30 | 0.41 | 0.85 | 0.97 | 0.88 | 0.047 | 0.82–0.97 | 0.91 | 0.67 | 0.47 | 0.55 | 0.93 | 0.97 | |
Linear SVM | 0.89 | 0.036 | 0.85–0.96 | 0.82 | 0.83 | 0.29 | 0.44 | 0.82 | 0.98 | 0.88 | 0.04 | 0.81–0.95 | 0.94 | 0.67 | 0.62 | 0.64 | 0.96 | 0.97 | |
RBF-Kernel SVM | 0.91 | 0.039 | 0.83–0.97 | 0.87 | 0.50 | 0.33 | 0.40 | 0.91 | 0.95 | 0.91 | 0.025 | 0.87–0.95 | 0.90 | 0.42 | 0.42 | 0.42 | 0.95 | 0.95 | |
Random forest | 0.83 | 0.034 | 0.77–0.88 | 0.96 | 0.58 | 0.88 | 0.70 | 0.99 | 0.96 | 0.81 | 0.055 | 0.75–0.91 | 0.88 | 0.67 | 0.38 | 0.49 | 0.90 | 0.97 | |
Decision tree | 0.91 | 0.045 | 0.83–0.98 | 0.91 | 0.50 | 0.46 | 0.48 | 0.95 | 0.95 | 0.92 | 0.032 | 0.87–0.97 | 0.90 | 0.67 | 0.42 | 0.52 | 0.92 | 0.97 | |
Gradient boosting | 0.90 | 0.040 | 0.85–0.97 | 0.86 | 0.83 | 0.36 | 0.50 | 0.86 | 0.98 | 0.87 | 0.035 | 0.82–0.94 | 0.94 | 0.75 | 0.64 | 0.69 | 0.96 | 0.98 | |
kNN | 0.90 | 0.035 | 0.82–0.96 | 0.87 | 0.42 | 0.29 | 0.35 | 0.91 | 0.94 | 0.92 | 0.032 | 0.85–0.98 | 0.89 | 0.33 | 0.33 | 0.33 | 0.94 | 0.94 | |
MLP-BP | 0.86 | 0.038 | 0.80–0.91 | 0.90 | 0.75 | 0.43 | 0.55 | 0.91 | 0.98 | 0.83 | 0.067 | 0.68–0.89 | 0.90 | 0.75 | 0.42 | 0.55 | 0.91 | 0.98 | |
LDA | 0.85 | 0.036 | 0.80–0.91 | 0.83 | 0.67 | 0.28 | 0.39 | 0.84 | 0.96 | 0.87 | 0.066 | 0.75–0.96 | 0.89 | 0.58 | 0.39 | 0.47 | 0.92 | 0.96 |
Algorithms | Number of features | 26 features | 15 features | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset | Training | Testing | Training | Testing | |||||||||||||||
Performance measures | Mean accuracy | SD | Range | Accuracy | Sensitivity | Precision | F1-score | SP | NPV | Mean accuracy | SD | Range | Accuracy | Sensitivity | Precision | F1-score | SP | NPV | |
Logistic regression | 0.89 | 0.044 | 0.83–0.95 | 0.92 | 0.75 | 0.50 | 0.60 | 0.93 | 0.98 | 0.91 | 0.035 | 0.84–0.97 | 0.92 | 0.75 | 0.53 | 0.62 | 0.94 | 0.98 | |
Linear SVM | 0.89 | 0.040 | 0.81–0.95 | 0.93 | 0.75 | 0.56 | 0.64 | 0.95 | 0.98 | 0.91 | 0.032 | 0.86–0.97 | 0.94 | 0.67 | 0.67 | 0.67 | 0.97 | 0.97 | |
RBF-Kernel SVM | 0.90 | 0.028 | 0.85–0.95 | 0.92 | 0.33 | 0.50 | 0.40 | 0.97 | 0.94 | 0.90 | 0.030 | 0.86–0.97 | 0.94 | 0.33 | 0.80 | 0.47 | 0.99 | 0.94 | |
Random forest | 0.92 | 0.032 | 0.86–0.97 | 0.97 | 0.75 | 0.90 | 0.81 | 0.99 | 0.98 | 0.92 | 0.032 | 0.86–0.97 | 0.94 | 0.42 | 0.83 | 0.56 | 0.99 | 0.95 | |
Decision tree | 0.91 | 0.040 | 0.83–0.97 | 0.95 | 0.75 | 0.69 | 0.72 | 0.97 | 0.98 | 0.88 | 0.038 | 0.83–0.97 | 0.92 | 0.42 | 0.56 | 0.48 | 0.97 | 0.95 |