Skip to main content

2024 | OriginalPaper | Buchkapitel

Analysis of Synthetic Data Generation Techniques in Diabetes Prediction

verfasst von : Sujit Kumar Das, Pinki Roy, Arnab Kumar Mishra

Erschienen in: Big Data, Machine Learning, and Applications

Verlag: Springer Nature Singapore

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The problem of inadequate and class imbalanced data is one of the major problems in the classification tasks. Therefore applying synthetic data generation (SDG) approaches to handle class imbalances can be useful in improving Machine Learning (ML) classifier’s performance. The aim of this work is to explore various SDG approaches to improve diabetes prediction using Pima Indian Diabetes Dataset (PIDD). We have also proposed a hybrid approach of SDG by combining the idea of popularly used SDG techniques Synthetic Minority Oversampling TEchnique (SMOTE) and SVM-SMOTE (Support Vector Machine-Synthetic Minority Oversampling TEchnique), named as SSVMSMOTE. The idea is to divide training data into equal halves and apply SMOTE and SVM-SMOTE separately to sub-training samples. The approach has successfully overcome the limitation of SMOTE and SVM-SMOTE. A set of classifiers namely Decision Tree (DT), Random Forest (RF), K-Nearest Neighbors (KNN), Logistic Regression (LR), Gaussian Naive Bayes (GNB), AdaBoost (AB), Extreme Gradient Boosting (XGB), Gradient Boosting (GB), and Light Gradient Boosting (LGM) are trained on the combined resampled training data and tested on hold out testset. The experiment shows that boosting classifiers, XGB, and GB outperformed other considered classifiers. Further, the XGB classifier, with the help of the proposed SDG technique, achieved the highest average accuracy of 0.9415. The proposed approach also achieved promising results in terms of other important evaluation metrics such as F-Scores, AUC, Sensitivity, and specificity. Therefore, such an impressive result of the proposed approach suggests its applicability in the real-life decision-making process.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Das SK, Mishra A, Roy P (2018) Automatic diabetes prediction using tree based ensemble learners. In: Proceedings of international conference on computational intelligence and IoT (ICCI IoT) Das SK, Mishra A, Roy P (2018) Automatic diabetes prediction using tree based ensemble learners. In: Proceedings of international conference on computational intelligence and IoT (ICCI IoT)
2.
Zurück zum Zitat Das SK, Roy P, Mishra AK (2021) Deep learning techniques dealing with diabetes mellitus: a comprehensive study. In: Health informatics: a computational perspective in healthcare. Springer, Singapore, pp 295–323 Das SK, Roy P, Mishra AK (2021) Deep learning techniques dealing with diabetes mellitus: a comprehensive study. In: Health informatics: a computational perspective in healthcare. Springer, Singapore, pp 295–323
3.
Zurück zum Zitat Das SK, Roy P, Mishra AK (2021) Recognition of ischaemia and infection in diabetic foot ulcer: a deep convolutional neural network based approach. Int J Imaging Syst Technol Das SK, Roy P, Mishra AK (2021) Recognition of ischaemia and infection in diabetic foot ulcer: a deep convolutional neural network based approach. Int J Imaging Syst Technol
4.
Zurück zum Zitat Das SK, Roy P, Mishra AK (2021) DFU_SPNet: a stacked parallel convolution layers based CNN to improve Diabetic Foot Ulcer classification. ICT Express Das SK, Roy P, Mishra AK (2021) DFU_SPNet: a stacked parallel convolution layers based CNN to improve Diabetic Foot Ulcer classification. ICT Express
6.
Zurück zum Zitat Mishra AK et al (2020) Identifying COVID19 from chest CT images: a deep convolutional neural networks based approach. J Healthc Eng 2020 Mishra AK et al (2020) Identifying COVID19 from chest CT images: a deep convolutional neural networks based approach. J Healthc Eng 2020
7.
Zurück zum Zitat Mishra AK et al (2021) Breast ultrasound tumour classification: a machine learning-radiomics based approach. Expert Syst, e12713 Mishra AK et al (2021) Breast ultrasound tumour classification: a machine learning-radiomics based approach. Expert Syst, e12713
8.
Zurück zum Zitat Jain D, Mishra AK, Das SK (2021) Machine learning based automatic prediction of Parkinson’s disease using speech features. In: Proceedings of international conference on artificial intelligence and applications. Springer, Singapore Jain D, Mishra AK, Das SK (2021) Machine learning based automatic prediction of Parkinson’s disease using speech features. In: Proceedings of international conference on artificial intelligence and applications. Springer, Singapore
9.
Zurück zum Zitat Das SK, Roy P, Mishra AK (2021) Fusion of handcrafted and deep convolutional neural network features for effective identification of diabetic foot ulcer. Concurr Comput Pract Exp, e6690 Das SK, Roy P, Mishra AK (2021) Fusion of handcrafted and deep convolutional neural network features for effective identification of diabetic foot ulcer. Concurr Comput Pract Exp, e6690
10.
Zurück zum Zitat Namasudra S (2020) Fast and secure data accessing by using DNA computing for the cloud environment. IEEE Trans Serv Comput Namasudra S (2020) Fast and secure data accessing by using DNA computing for the cloud environment. IEEE Trans Serv Comput
11.
Zurück zum Zitat Namasudra S et al (2020) Securing multimedia by using DNA-based encryption in the cloud computing environment. ACM Trans Multimed Comput Commun Appl (TOMM) 16(3s):1–19 Namasudra S et al (2020) Securing multimedia by using DNA-based encryption in the cloud computing environment. ACM Trans Multimed Comput Commun Appl (TOMM) 16(3s):1–19
12.
Zurück zum Zitat Sharma P, Borah MD, Namasudra S (2021) Improving security of medical big data by using Blockchain technology. Comput Electr Eng 96:107529 Sharma P, Borah MD, Namasudra S (2021) Improving security of medical big data by using Blockchain technology. Comput Electr Eng 96:107529
14.
Zurück zum Zitat Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357 Chawla NV et al (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
15.
Zurück zum Zitat Nguyen Hien M, Cooper Eric W, Kamei Katsuari (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Parad 3(1):4–21CrossRef Nguyen Hien M, Cooper Eric W, Kamei Katsuari (2011) Borderline over-sampling for imbalanced data classification. Int J Knowl Eng Soft Data Parad 3(1):4–21CrossRef
16.
Zurück zum Zitat García-Ordás MT et al (2021) Diabetes detection using deep learning techniques with oversampling and feature augmentation. Comput Methods Programs Biomed 202:105968 García-Ordás MT et al (2021) Diabetes detection using deep learning techniques with oversampling and feature augmentation. Comput Methods Programs Biomed 202:105968
17.
Zurück zum Zitat Pradipta GA et al (2021) Radius-SMOTE: a new oversampling technique of minority samples based on radius distance for learning from imbalanced data. IEEE Access 9:74763–74777 Pradipta GA et al (2021) Radius-SMOTE: a new oversampling technique of minority samples based on radius distance for learning from imbalanced data. IEEE Access 9:74763–74777
18.
Zurück zum Zitat Leguen-deVarona I et al (2020) SMOTE-Cov: a new oversampling method based on the covariance matrix. In: Data analysis and optimization for engineering and computing problems. Springer, Cham, pp 207–215 Leguen-deVarona I et al (2020) SMOTE-Cov: a new oversampling method based on the covariance matrix. In: Data analysis and optimization for engineering and computing problems. Springer, Cham, pp 207–215
19.
Zurück zum Zitat Zhang Y, Jian X (2021) Unbalanced data classification based on oversampling and integrated learning. In: 2021 Asia-Pacific conference on communications technology and computer science (ACCTCS). IEEE Zhang Y, Jian X (2021) Unbalanced data classification based on oversampling and integrated learning. In: 2021 Asia-Pacific conference on communications technology and computer science (ACCTCS). IEEE
20.
Zurück zum Zitat He H et al (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE He H et al (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE
21.
Zurück zum Zitat Batista GE, Bazzan ALC, Monard MC (2003) Balancing training data for automated annotation of keywords: a case study. WOB Batista GE, Bazzan ALC, Monard MC (2003) Balancing training data for automated annotation of keywords: a case study. WOB
22.
Zurück zum Zitat Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor Newsl 6(1):20–29CrossRef Batista G, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. ACM Sigkdd Explor Newsl 6(1):20–29CrossRef
23.
Zurück zum Zitat Nnamoko Nonso, Korkontzelos Ioannis (2020) Efficient treatment of outliers and class imbalance for diabetes prediction. Artif Intell Med 104:101815CrossRef Nnamoko Nonso, Korkontzelos Ioannis (2020) Efficient treatment of outliers and class imbalance for diabetes prediction. Artif Intell Med 104:101815CrossRef
Metadaten
Titel
Analysis of Synthetic Data Generation Techniques in Diabetes Prediction
verfasst von
Sujit Kumar Das
Pinki Roy
Arnab Kumar Mishra
Copyright-Jahr
2024
Verlag
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-99-3481-2_45

Premium Partner