Introduction
-
Preprocessing the data which includes normalizing step, and filling up missing values by using different algorithms such as, the mean value, KNN, MICE, and RF.
-
Solving the Framingham data balance challenge with the SMOTE approach.
-
Classifying the patients with cardiovascular disease, by using various machine learning techniques like Nu SVM, Gradient Boosting regressor, Extreme gradient boost,ADA Boost, ExtraTrees,LGBM,SGD and the stacking algorithm.
-
Comparing numerous machine learning methods, such as Nu SVM, Gradient Boosting Regressor, Extreme gradient boost, and the stacking algorithm, utilizing various metrics and the Receiver Operating Characteristic (ROC) curve.
Related works
Methodology
Datasets description
UCI heart disease dataset
Variable | Description |
---|---|
Age | Age in years (29 to 77) |
Sex | Gender instance (0 = Female, 1 = Male) |
ChestPainType | Chest pain type (1: typical angina, 2: atypical angina, 3: non- anginal pain, 4: asymptomatic) |
RestBloodPressure | Resting blood pressure in mm Hg[94, 200] |
ChestPainType | Serum cholesterol in mg/dl[126, 564] |
FastingBloodSugar | Fasting blood sugar> 120 mg/dl (0 = False, 1= True) |
ResElectrocardiograp | Resting ECG results (0: normal, 1: ST-T wave abnormality, 2: LV hypertrophy) |
MaxHeartRate | Maximum heart rate achieved[71,202] |
ExerciseInduced | Exercise-induced angina (0: No, 1: Yes) |
Oldpeak | ST depression induced by exercise relative to rest [0.0, 62.0] |
Slope | Slope of the peak exercise ST segment (1: up-sloping, 2: flat, 3: downsloping) |
MajorVessels | Number of major vessels colored by fluoroscopy (values 0 - 3) |
Thal | Defect types: value 3: normal, 6: fixed defect, 7: irreversible defect |
HeartDisease | Target : value 0: absence of disease, 1 or 2 or 3 or 4 or 5: presence of cardiovascular disease |
Framingham dataset
Variable | Description |
---|---|
Age | Age in years (32 to 70) |
Male | Gender instance (1 = Female, 0= Male) |
Education | Level of education (1 to 4) |
CurrentSmoker | Whether or not the patient is a current smoker 0 : no 1 : yes |
CurrentSmoker | Whether or not the patient is a current smoker 0 : no 1 : yes |
CigsPerDay | The number of cigarettes that the person smoked on average in one day |
BPMeds | Whether or not the patient was on blood pressure medication 0 : no 1 : yes |
PrevalentStroke | Whether or not the patient was on blood pressure medication 0 : no 1 : yes |
PrevalentHyp | Whether or not the patient was hypertensive 0 : no 1 : yes |
Diabetes | Whether or not the patient had diabetes 0 : no 1 : yes |
TotChol | Total cholesterol level |
SysBP | Systolic blood pressure |
DiaBP | Diastolic blood pressure |
BMI | Body Mass Index |
Heart Rate | Measure of heart rate |
Glucose | Glucose level |
TenyearHeart | whether or not the patient will develop heart disease in the future ten years (target) 0 : no 1 : yes |
Data processing
Label conversion
Data normalization
Filling missing data
-
Mean value: this algorithm consists of replacing the missing value with the mean of each column. This technique is frequently used because it is easy to implement [28], its formula is as below :$$\begin{aligned} \begin{array}{rcl} X_{i}= \frac{\sum _{i=1}^{j}x_{ij}}{n} \end{array} \end{aligned}$$(2)
-
K Nearest Neighbor (KNN): is an algorithm that is useful to match a point in a multi-dimensional space with its nearest k neighbors. It can be used for continuous, discrete, ordinal, and categorical data, which makes it particularly useful for handling missing data [29].For example, we have 13 variables where RestBloodPressure is missing, Therefore for a given missing value, we will look at the other characteristics of the person, look for its k nearest neighbors then we can then approximate the RestBloodPressure of the person we wanted.To select the best k (k nearest neighbors) value for the KNN algorithm, we trained two classifiers, SVM and BN, for various values of k, the better accuracy is achieved for k = 6 for the two classifiers. the Results are given in Fig. 2 below.×
-
Random Forest (RF): is becoming increasingly popular for dealing with missing data, particularly in biomedical research. Unlike traditional imputation methods, RF does not assume normality or necessitate the specification of parametric models [30].For instance, the Chol feature lacks 23 values. RF imputes missing data using the mean/mode, and then fits a random forest on the observed portion and predicts the missing portion for each variable with missing values. This training and prediction process is performed iteratively until the desired level of accuracy is attained or a user-specified maximum number of iterations is reached.
-
Multiple Imputations by Chained Equations (MICE): By using a divide and conquer method, MICE imputes missing values of the data set, by concentrating on one variable at a time. It uses all the other variables in the data set (or a sensitively selected subset of such variables) to estimate missing in that variable [31].MICE replaces missing values in each variable with temporary values derived from the variable’s non-missing values. Replace the missing oldpeak value with the mean oldpeak value observed in the data, for example, or replace the missing ca values with the mean ca value observed in the data, and so on. MICE then returns to missing temporary value imputations for the oldpeak variable only. As a result, the current data has missing values for oldpeak but not for income or gender. The algorithm uses a linear regression to regress oldpeak on other variables; to fit the model to the current data copy, drop all records where oldpeak is missing during the model fitting process. The dependent variable in this model is oldpeak, and the other features are independent variables. MICE predicts the missing oldpeak values using the fitted regression model from the previous step. The algorithm goes through the same steps for each variable that has missing data.
Testing the methods of filling up the missing data
Classification phase
Training phase
Parameters | XGBoost | AdaBoost | Gradient Boost | Extra Trees | LightGbm | SGDC | Nu svc |
---|---|---|---|---|---|---|---|
Learning rate | 0.1 | 1 | 1 | 0.009 | adaptative | ||
Number of estimators | 100 | 50 | 3 | 80 | 1000 | ||
Loss | deviance | log | |||||
Objective | binary | ||||||
Number of pass training | 1000 | ||||||
Fraction of margin error | 0.25 | ||||||
kernel | RBF |
Testing phase
Results and performances of classification
-
Specificity: is the proportion of the predicted negative cases that were correctly identified.$$\begin{aligned} \begin{array}{rcl} Specificity = \frac{TN}{TN+FP} \end{array} \end{aligned}$$(3)
-
Precision: is defined as the proportion of correctly predicted positive observations to all predicted positive observations.$$\begin{aligned} \begin{array}{rcl} Precision = \frac{TP}{TP+FP} \end{array} \end{aligned}$$(4)
-
Accuracy: measures the proportion of correctly predicted observations to total observations[76].$$\begin{aligned} \begin{array}{rcl} Accuracy = \frac{TP+TN}{TP+TN+FP+FN} \end{array} \end{aligned}$$(5)
-
Recall: measures how well the model identifies True Positives.$$\begin{aligned} \begin{array}{rcl} Recall = \frac{TP}{TP+FN} \end{array} \end{aligned}$$(6)
-
F-measure : measures probability that a positive prediction is correct.$$\begin{aligned} \begin{array}{rcl} F-measure = \frac{2*Precision*Recall}{Precision+Recall} \end{array} \end{aligned}.$$(7)
Results and discussion
Performance results
Metrics | XGBoost | AdaBoost | Gradient Boost | Extra Trees | LightGbm | SGDC | Nu svc | Stacking algorithm |
---|---|---|---|---|---|---|---|---|
Accuracy | 92.37 | 91.66 | 90.27 | 94.44 | 94.44 | 92.36 | 93.75 | 95.83 |
Specificity | 92.24 | 93.24 | 90.54 | 94.59 | 94.59 | 91.89 | 93.24 | 94.59 |
Precision | 92 | 90.98 | 90.54 | 94.59 | 94.59 | 93.15 | 94.52 | 97 |
Recall | 93.24 | 93.24 | 90.54 | 94.59 | 94.59 | 91.89 | 93.24 | 95 |
F score | 92.61 | 92 | 90.54 | 94.59 | 94.59 | 92.51 | 93.87 | 96 |
Class | 0 | 1 |
---|---|---|
Before SMOTE | 3596 | 466 |
After SMOTE | 3596 | 3596 |
Metrics | Accuracy | Specificity | Precision | Recall | F-measure |
---|---|---|---|---|---|
Values | 90.24 | 87.51 | 92 | 88 | 90 |
Performance evaluation (%) | Our approach UCI heart dataset | SVM Model Previous work [8] | BN Model Previous work [8] | LR Model Previous work [32] | KNN Model Previous work [32] | SVM Model Previous work [32] | DT Model Previous work [32] | RF Model Previous work [32] | SVM Model Previous work [33] | RF Model Previous work [33] |
---|---|---|---|---|---|---|---|---|---|---|
Accuracy | 95,83 | 57 | 52 | 90,16 | 91,80 | 90,16 | 86,89 | 85,25 | 88,9823 | 88,9812 |
Specificity | 94,59 | 87 | 86 | |||||||
Precision | 97 | 93,33 | 93,55 | 93,33 | 85,29 | 87,10 | ||||
Recall | 95 | 87,50 | 90,62 | 87,50 | 90,62 | 84,38 | ||||
F-measure | 96 | 90,32 | 92,06 | 90,32 | 87,88 | 85,71 |