Skip to main content

13.08.2024

Comparative Analysis of Machine Learning Techniques for Imbalanced Genetic Data

verfasst von: Arshmeet Kaur, Morteza Sarmadi

Erschienen in: Annals of Data Science

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Advancements in genome sequencing technologies have significantly increased the availability of genomic data. The use of machine learning models to predict the pathogenicity or clinical significance of genetic mutations is crucial. However, genetic datasets often feature imbalanced target variables and high-cardinality, skewed predictor variables. These attributes complicate machine learning modeling processes. This study addresses these challenges in both regression and classification tasks. In this study, we systematically explored the impact of various data preprocessing techniques, feature selection methods, and model choices on the performance of machine learning models trained on imbalanced genetic data. We evaluated the performance metrics using fivefold cross-validation. Our key findings demonstrate that the regression models are robust to outliers and skew in predictor and target variables. Similarly, in classification tasks, class-imbalanced target variables and skewed predictors minimally impact model performance. Among the models tested, random forest was the most effective model for both imbalanced regression and classification tasks. Our key contributions are as follows: we address a significant research gap by focusing on imbalanced regression, a problem that is sparsely explored compared to class-imbalanced classification. We identify the techniques that improve prediction performance and provide practical insights into handling genetic data. Additionally, we provide a foundation for future research to further optimize machine learning approaches in genomics. This study uses a genetic dataset as a case, but our findings are applicable to imbalanced data in other fields.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Shi Y (2022) Advances in big data analytics: theory, algorithm and practice. Springer, SingaporeCrossRef Shi Y (2022) Advances in big data analytics: theory, algorithm and practice. Springer, SingaporeCrossRef
2.
Zurück zum Zitat Tien JM (2017) Internet of things, real-time decision making, and artificial intelligence. Ann Data Sci 4:149–178CrossRef Tien JM (2017) Internet of things, real-time decision making, and artificial intelligence. Ann Data Sci 4:149–178CrossRef
3.
Zurück zum Zitat Hassan M, Awan FM, Naz A et al (2022) Innovations in genomics and big data analytics for personalized medicine and health care: A review. Int J Mol Sci 23(9):4645CrossRef Hassan M, Awan FM, Naz A et al (2022) Innovations in genomics and big data analytics for personalized medicine and health care: A review. Int J Mol Sci 23(9):4645CrossRef
4.
Zurück zum Zitat Olson DL, Shi Y (2007) Introduction to business data mining. McGraw-Hill/Irwin, New York Olson DL, Shi Y (2007) Introduction to business data mining. McGraw-Hill/Irwin, New York
6.
Zurück zum Zitat Niroula A, Vihinen M (2019) How good are pathogenicity predictors in detecting benign variants? PLoS Comput Biol 15(2):e1006481CrossRef Niroula A, Vihinen M (2019) How good are pathogenicity predictors in detecting benign variants? PLoS Comput Biol 15(2):e1006481CrossRef
12.
Zurück zum Zitat Luo H, Pan X, Wang Q, et al (2019) Logistic regression and random forest for effective imbalanced classification. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC). IEEE. 916–917 Luo H, Pan X, Wang Q, et al (2019) Logistic regression and random forest for effective imbalanced classification. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC). IEEE. 916–917
13.
Zurück zum Zitat Esteves VMS (2020) Techniques to deal with imbalanced data in multi-class problems: a review of existing methods. Universidade do Porto, Portugal Esteves VMS (2020) Techniques to deal with imbalanced data in multi-class problems: a review of existing methods. Universidade do Porto, Portugal
14.
Zurück zum Zitat Mirza B, Kok S, Lin Z, et al (2016) Efficient representation learning for high-dimensional imbalance data. In: 2016 IEEE international conference on digital signal processing (DSP). IEEE, pp 511–515 Mirza B, Kok S, Lin Z, et al (2016) Efficient representation learning for high-dimensional imbalance data. In: 2016 IEEE international conference on digital signal processing (DSP). IEEE, pp 511–515
15.
Zurück zum Zitat Abd Elrahman SM, Abraham A (2013) A review of class imbalance problem. J Netw Innovat Comput 1(2013):332–340 Abd Elrahman SM, Abraham A (2013) A review of class imbalance problem. J Netw Innovat Comput 1(2013):332–340
17.
Zurück zum Zitat Sim NL, Kumar P, Hu J et al (2012) SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res 40(W1):W452–W457CrossRef Sim NL, Kumar P, Hu J et al (2012) SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res 40(W1):W452–W457CrossRef
18.
Zurück zum Zitat Sunyaev S, Ramensky V, Koch I et al (2001) Prediction of deleterious human alleles. Hum Mol Genet 10(6):591–597CrossRef Sunyaev S, Ramensky V, Koch I et al (2001) Prediction of deleterious human alleles. Hum Mol Genet 10(6):591–597CrossRef
20.
23.
Zurück zum Zitat Eddy SR (2004) Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 22(8):1035–1036CrossRef Eddy SR (2004) Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 22(8):1035–1036CrossRef
24.
Zurück zum Zitat Pargent F, Pfisterer F, Thomas J et al (2022) Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Comput Stat 37(5):2671–2692CrossRef Pargent F, Pfisterer F, Thomas J et al (2022) Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Comput Stat 37(5):2671–2692CrossRef
26.
Zurück zum Zitat Curran-Everett D (2018) Explorations in statistics: the log transformation. Adv Physiol Educ 42(2):343–347CrossRef Curran-Everett D (2018) Explorations in statistics: the log transformation. Adv Physiol Educ 42(2):343–347CrossRef
27.
Zurück zum Zitat Weisberg S (2001) Yeo-Johnson power transformations. Accessed 1 June 2023 Weisberg S (2001) Yeo-Johnson power transformations. Accessed 1 June 2023
28.
Zurück zum Zitat Yeo IK, Johnson RA (2000) A new family of power transformations to improve normality or symmetry. Biometrika 87(4):954–959CrossRef Yeo IK, Johnson RA (2000) A new family of power transformations to improve normality or symmetry. Biometrika 87(4):954–959CrossRef
29.
Zurück zum Zitat Changyong F, Hongyue W, Naiji L et al (2014) Log-transformation and its implications for data analysis. Shanghai Arch Psychiatry 26(2):105 Changyong F, Hongyue W, Naiji L et al (2014) Log-transformation and its implications for data analysis. Shanghai Arch Psychiatry 26(2):105
30.
Zurück zum Zitat Feng C, Wang H, Lu N et al (2013) Log transformation: application and interpretation in biomedical research. Stat Med 32(2):230–239CrossRef Feng C, Wang H, Lu N et al (2013) Log transformation: application and interpretation in biomedical research. Stat Med 32(2):230–239CrossRef
31.
Zurück zum Zitat Keene ON (1995) The log transformation is special. Stat Med 14(8):811–819CrossRef Keene ON (1995) The log transformation is special. Stat Med 14(8):811–819CrossRef
32.
Zurück zum Zitat Ni W (2012) A review and comparative study on univariate feature selection techniques. Master’s thesis. University of Cincinnati Ni W (2012) A review and comparative study on univariate feature selection techniques. Master’s thesis. University of Cincinnati
33.
Zurück zum Zitat Zuliani M (2009) RANSAC for Dummies. Accessed 5 Feb 2024 Zuliani M (2009) RANSAC for Dummies. Accessed 5 Feb 2024
34.
Zurück zum Zitat Derpanis KG (2010) Overview of the RANSAC Algorithm. Image Rochester NY 4(1):2–3 Derpanis KG (2010) Overview of the RANSAC Algorithm. Image Rochester NY 4(1):2–3
35.
Zurück zum Zitat Charilaou P, Battat R (2022) Machine learning models and over-fitting considerations. World J Gastroenterol 28(5):605CrossRef Charilaou P, Battat R (2022) Machine learning models and over-fitting considerations. World J Gastroenterol 28(5):605CrossRef
36.
Zurück zum Zitat Montesinos López OA, Montesinos López A, Crossa J (2022) Overfitting, model tuning, and evaluation of prediction performance. In: Multivariate statistical machine learning methods for genomic prediction. Springer, pp 109–139 Montesinos López OA, Montesinos López A, Crossa J (2022) Overfitting, model tuning, and evaluation of prediction performance. In: Multivariate statistical machine learning methods for genomic prediction. Springer, pp 109–139
37.
Zurück zum Zitat Koller M (2016) Robustlmm: an R package for robust estimation of linear mixed-effects models. J Stat Softw 75:1–24CrossRef Koller M (2016) Robustlmm: an R package for robust estimation of linear mixed-effects models. J Stat Softw 75:1–24CrossRef
38.
Zurück zum Zitat Douglas Bates M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67(1):1–48 Douglas Bates M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67(1):1–48
40.
Zurück zum Zitat Schielzeth H, Dingemanse NJ, Nakagawa S et al (2020) Robustness of linear mixed-effects models to violations of distributional assumptions. Methods Ecol Evol 11(9):1141–1152CrossRef Schielzeth H, Dingemanse NJ, Nakagawa S et al (2020) Robustness of linear mixed-effects models to violations of distributional assumptions. Methods Ecol Evol 11(9):1141–1152CrossRef
41.
Zurück zum Zitat Koo TK, Li MY (2016) A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med 15(2):155–163CrossRef Koo TK, Li MY (2016) A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med 15(2):155–163CrossRef
42.
Zurück zum Zitat Nakagawa S, Schielzeth H (2013) A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods Ecol Evol 4(2):133–142CrossRef Nakagawa S, Schielzeth H (2013) A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods Ecol Evol 4(2):133–142CrossRef
43.
Zurück zum Zitat Bobak CA, Barr PJ, O’Malley AJ (2018) Estimation of an inter-rater intra-class correlation coefficient that overcomes common assumption violations in the assessment of health measurement scales. BMC Med Res Methodol 18(1):1–11CrossRef Bobak CA, Barr PJ, O’Malley AJ (2018) Estimation of an inter-rater intra-class correlation coefficient that overcomes common assumption violations in the assessment of health measurement scales. BMC Med Res Methodol 18(1):1–11CrossRef
46.
Zurück zum Zitat Nyongesa D (2020) Variable selection using Random Forests in SAS. In: SAS Global Forum Nyongesa D (2020) Variable selection using Random Forests in SAS. In: SAS Global Forum
47.
Zurück zum Zitat Silva A, Ribeiro RP, Moniz N (2022) Model optimization in imbalanced regression. In: International conference on discovery science. Springer, pp 3–21 Silva A, Ribeiro RP, Moniz N (2022) Model optimization in imbalanced regression. In: International conference on discovery science. Springer, pp 3–21
49.
51.
Zurück zum Zitat Shi Y, Tian Y, Kou G et al (2011) Optimization based data mining: theory and applications. Springer, BerlinCrossRef Shi Y, Tian Y, Kou G et al (2011) Optimization based data mining: theory and applications. Springer, BerlinCrossRef
52.
Zurück zum Zitat Wang S, Capponi S, Bianco S (2022) Inferring conditional probability distributions of noisy gene expression from limited observations by deep learning. GEN Biotechnol 1(6):504–513CrossRef Wang S, Capponi S, Bianco S (2022) Inferring conditional probability distributions of noisy gene expression from limited observations by deep learning. GEN Biotechnol 1(6):504–513CrossRef
53.
Zurück zum Zitat Qi Z, Tian Y, Shi Y (2013) Robust twin support vector machine for pattern classification. Pattern Recognit 46(1):305–316CrossRef Qi Z, Tian Y, Shi Y (2013) Robust twin support vector machine for pattern classification. Pattern Recognit 46(1):305–316CrossRef
54.
Zurück zum Zitat Qi Z, Tian Y, Shi Y (2013) Structural twin support vector machine for classification. Knowl Based Syst 43:74–81CrossRef Qi Z, Tian Y, Shi Y (2013) Structural twin support vector machine for classification. Knowl Based Syst 43:74–81CrossRef
55.
Zurück zum Zitat Tian Y, Shi Y, Liu X (2012) Recent advances on support vector machines research. Technol Econ Dev Econ 18(1):5–33CrossRef Tian Y, Shi Y, Liu X (2012) Recent advances on support vector machines research. Technol Econ Dev Econ 18(1):5–33CrossRef
56.
Zurück zum Zitat Iranmehr A, Masnadi-Shirazi H, Vasconcelos N (2019) Cost-sensitive support vector machines. Neurocomputing (Amst) 343:50–64CrossRef Iranmehr A, Masnadi-Shirazi H, Vasconcelos N (2019) Cost-sensitive support vector machines. Neurocomputing (Amst) 343:50–64CrossRef
60.
Zurück zum Zitat Miao J, Ping Y, Chen Z et al (2021) Unsupervised feature selection by non-convex regularized self-representation. Expert Syst Appl 173:114643CrossRef Miao J, Ping Y, Chen Z et al (2021) Unsupervised feature selection by non-convex regularized self-representation. Expert Syst Appl 173:114643CrossRef
61.
Zurück zum Zitat Miao J, Yang T, Sun L et al (2022) Graph regularized locally linear embedding for unsupervised feature selection. Pattern Recognit 122:108299CrossRef Miao J, Yang T, Sun L et al (2022) Graph regularized locally linear embedding for unsupervised feature selection. Pattern Recognit 122:108299CrossRef
Metadaten
Titel
Comparative Analysis of Machine Learning Techniques for Imbalanced Genetic Data
verfasst von
Arshmeet Kaur
Morteza Sarmadi
Publikationsdatum
13.08.2024
Verlag
Springer Berlin Heidelberg
Erschienen in
Annals of Data Science
Print ISSN: 2198-5804
Elektronische ISSN: 2198-5812
DOI
https://doi.org/10.1007/s40745-024-00575-8