Skip to main content
Top

13-08-2024

Comparative Analysis of Machine Learning Techniques for Imbalanced Genetic Data

Authors: Arshmeet Kaur, Morteza Sarmadi

Published in: Annals of Data Science

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Advancements in genome sequencing technologies have significantly increased the availability of genomic data. The use of machine learning models to predict the pathogenicity or clinical significance of genetic mutations is crucial. However, genetic datasets often feature imbalanced target variables and high-cardinality, skewed predictor variables. These attributes complicate machine learning modeling processes. This study addresses these challenges in both regression and classification tasks. In this study, we systematically explored the impact of various data preprocessing techniques, feature selection methods, and model choices on the performance of machine learning models trained on imbalanced genetic data. We evaluated the performance metrics using fivefold cross-validation. Our key findings demonstrate that the regression models are robust to outliers and skew in predictor and target variables. Similarly, in classification tasks, class-imbalanced target variables and skewed predictors minimally impact model performance. Among the models tested, random forest was the most effective model for both imbalanced regression and classification tasks. Our key contributions are as follows: we address a significant research gap by focusing on imbalanced regression, a problem that is sparsely explored compared to class-imbalanced classification. We identify the techniques that improve prediction performance and provide practical insights into handling genetic data. Additionally, we provide a foundation for future research to further optimize machine learning approaches in genomics. This study uses a genetic dataset as a case, but our findings are applicable to imbalanced data in other fields.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Shi Y (2022) Advances in big data analytics: theory, algorithm and practice. Springer, SingaporeCrossRef Shi Y (2022) Advances in big data analytics: theory, algorithm and practice. Springer, SingaporeCrossRef
2.
go back to reference Tien JM (2017) Internet of things, real-time decision making, and artificial intelligence. Ann Data Sci 4:149–178CrossRef Tien JM (2017) Internet of things, real-time decision making, and artificial intelligence. Ann Data Sci 4:149–178CrossRef
3.
go back to reference Hassan M, Awan FM, Naz A et al (2022) Innovations in genomics and big data analytics for personalized medicine and health care: A review. Int J Mol Sci 23(9):4645CrossRef Hassan M, Awan FM, Naz A et al (2022) Innovations in genomics and big data analytics for personalized medicine and health care: A review. Int J Mol Sci 23(9):4645CrossRef
4.
go back to reference Olson DL, Shi Y (2007) Introduction to business data mining. McGraw-Hill/Irwin, New York Olson DL, Shi Y (2007) Introduction to business data mining. McGraw-Hill/Irwin, New York
6.
go back to reference Niroula A, Vihinen M (2019) How good are pathogenicity predictors in detecting benign variants? PLoS Comput Biol 15(2):e1006481CrossRef Niroula A, Vihinen M (2019) How good are pathogenicity predictors in detecting benign variants? PLoS Comput Biol 15(2):e1006481CrossRef
12.
go back to reference Luo H, Pan X, Wang Q, et al (2019) Logistic regression and random forest for effective imbalanced classification. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC). IEEE. 916–917 Luo H, Pan X, Wang Q, et al (2019) Logistic regression and random forest for effective imbalanced classification. In: 2019 IEEE 43rd annual computer software and applications conference (COMPSAC). IEEE. 916–917
13.
go back to reference Esteves VMS (2020) Techniques to deal with imbalanced data in multi-class problems: a review of existing methods. Universidade do Porto, Portugal Esteves VMS (2020) Techniques to deal with imbalanced data in multi-class problems: a review of existing methods. Universidade do Porto, Portugal
14.
go back to reference Mirza B, Kok S, Lin Z, et al (2016) Efficient representation learning for high-dimensional imbalance data. In: 2016 IEEE international conference on digital signal processing (DSP). IEEE, pp 511–515 Mirza B, Kok S, Lin Z, et al (2016) Efficient representation learning for high-dimensional imbalance data. In: 2016 IEEE international conference on digital signal processing (DSP). IEEE, pp 511–515
15.
go back to reference Abd Elrahman SM, Abraham A (2013) A review of class imbalance problem. J Netw Innovat Comput 1(2013):332–340 Abd Elrahman SM, Abraham A (2013) A review of class imbalance problem. J Netw Innovat Comput 1(2013):332–340
17.
go back to reference Sim NL, Kumar P, Hu J et al (2012) SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res 40(W1):W452–W457CrossRef Sim NL, Kumar P, Hu J et al (2012) SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic Acids Res 40(W1):W452–W457CrossRef
18.
go back to reference Sunyaev S, Ramensky V, Koch I et al (2001) Prediction of deleterious human alleles. Hum Mol Genet 10(6):591–597CrossRef Sunyaev S, Ramensky V, Koch I et al (2001) Prediction of deleterious human alleles. Hum Mol Genet 10(6):591–597CrossRef
23.
go back to reference Eddy SR (2004) Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 22(8):1035–1036CrossRef Eddy SR (2004) Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 22(8):1035–1036CrossRef
24.
go back to reference Pargent F, Pfisterer F, Thomas J et al (2022) Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Comput Stat 37(5):2671–2692CrossRef Pargent F, Pfisterer F, Thomas J et al (2022) Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Comput Stat 37(5):2671–2692CrossRef
26.
go back to reference Curran-Everett D (2018) Explorations in statistics: the log transformation. Adv Physiol Educ 42(2):343–347CrossRef Curran-Everett D (2018) Explorations in statistics: the log transformation. Adv Physiol Educ 42(2):343–347CrossRef
27.
go back to reference Weisberg S (2001) Yeo-Johnson power transformations. Accessed 1 June 2023 Weisberg S (2001) Yeo-Johnson power transformations. Accessed 1 June 2023
28.
go back to reference Yeo IK, Johnson RA (2000) A new family of power transformations to improve normality or symmetry. Biometrika 87(4):954–959CrossRef Yeo IK, Johnson RA (2000) A new family of power transformations to improve normality or symmetry. Biometrika 87(4):954–959CrossRef
29.
go back to reference Changyong F, Hongyue W, Naiji L et al (2014) Log-transformation and its implications for data analysis. Shanghai Arch Psychiatry 26(2):105 Changyong F, Hongyue W, Naiji L et al (2014) Log-transformation and its implications for data analysis. Shanghai Arch Psychiatry 26(2):105
30.
go back to reference Feng C, Wang H, Lu N et al (2013) Log transformation: application and interpretation in biomedical research. Stat Med 32(2):230–239CrossRef Feng C, Wang H, Lu N et al (2013) Log transformation: application and interpretation in biomedical research. Stat Med 32(2):230–239CrossRef
31.
go back to reference Keene ON (1995) The log transformation is special. Stat Med 14(8):811–819CrossRef Keene ON (1995) The log transformation is special. Stat Med 14(8):811–819CrossRef
32.
go back to reference Ni W (2012) A review and comparative study on univariate feature selection techniques. Master’s thesis. University of Cincinnati Ni W (2012) A review and comparative study on univariate feature selection techniques. Master’s thesis. University of Cincinnati
33.
go back to reference Zuliani M (2009) RANSAC for Dummies. Accessed 5 Feb 2024 Zuliani M (2009) RANSAC for Dummies. Accessed 5 Feb 2024
34.
go back to reference Derpanis KG (2010) Overview of the RANSAC Algorithm. Image Rochester NY 4(1):2–3 Derpanis KG (2010) Overview of the RANSAC Algorithm. Image Rochester NY 4(1):2–3
35.
go back to reference Charilaou P, Battat R (2022) Machine learning models and over-fitting considerations. World J Gastroenterol 28(5):605CrossRef Charilaou P, Battat R (2022) Machine learning models and over-fitting considerations. World J Gastroenterol 28(5):605CrossRef
36.
go back to reference Montesinos López OA, Montesinos López A, Crossa J (2022) Overfitting, model tuning, and evaluation of prediction performance. In: Multivariate statistical machine learning methods for genomic prediction. Springer, pp 109–139 Montesinos López OA, Montesinos López A, Crossa J (2022) Overfitting, model tuning, and evaluation of prediction performance. In: Multivariate statistical machine learning methods for genomic prediction. Springer, pp 109–139
37.
go back to reference Koller M (2016) Robustlmm: an R package for robust estimation of linear mixed-effects models. J Stat Softw 75:1–24CrossRef Koller M (2016) Robustlmm: an R package for robust estimation of linear mixed-effects models. J Stat Softw 75:1–24CrossRef
38.
go back to reference Douglas Bates M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67(1):1–48 Douglas Bates M, Bolker B, Walker S (2015) Fitting linear mixed-effects models using lme4. J Stat Softw 67(1):1–48
40.
go back to reference Schielzeth H, Dingemanse NJ, Nakagawa S et al (2020) Robustness of linear mixed-effects models to violations of distributional assumptions. Methods Ecol Evol 11(9):1141–1152CrossRef Schielzeth H, Dingemanse NJ, Nakagawa S et al (2020) Robustness of linear mixed-effects models to violations of distributional assumptions. Methods Ecol Evol 11(9):1141–1152CrossRef
41.
go back to reference Koo TK, Li MY (2016) A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med 15(2):155–163CrossRef Koo TK, Li MY (2016) A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med 15(2):155–163CrossRef
42.
go back to reference Nakagawa S, Schielzeth H (2013) A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods Ecol Evol 4(2):133–142CrossRef Nakagawa S, Schielzeth H (2013) A general and simple method for obtaining R2 from generalized linear mixed-effects models. Methods Ecol Evol 4(2):133–142CrossRef
43.
go back to reference Bobak CA, Barr PJ, O’Malley AJ (2018) Estimation of an inter-rater intra-class correlation coefficient that overcomes common assumption violations in the assessment of health measurement scales. BMC Med Res Methodol 18(1):1–11CrossRef Bobak CA, Barr PJ, O’Malley AJ (2018) Estimation of an inter-rater intra-class correlation coefficient that overcomes common assumption violations in the assessment of health measurement scales. BMC Med Res Methodol 18(1):1–11CrossRef
46.
go back to reference Nyongesa D (2020) Variable selection using Random Forests in SAS. In: SAS Global Forum Nyongesa D (2020) Variable selection using Random Forests in SAS. In: SAS Global Forum
47.
go back to reference Silva A, Ribeiro RP, Moniz N (2022) Model optimization in imbalanced regression. In: International conference on discovery science. Springer, pp 3–21 Silva A, Ribeiro RP, Moniz N (2022) Model optimization in imbalanced regression. In: International conference on discovery science. Springer, pp 3–21
51.
go back to reference Shi Y, Tian Y, Kou G et al (2011) Optimization based data mining: theory and applications. Springer, BerlinCrossRef Shi Y, Tian Y, Kou G et al (2011) Optimization based data mining: theory and applications. Springer, BerlinCrossRef
52.
go back to reference Wang S, Capponi S, Bianco S (2022) Inferring conditional probability distributions of noisy gene expression from limited observations by deep learning. GEN Biotechnol 1(6):504–513CrossRef Wang S, Capponi S, Bianco S (2022) Inferring conditional probability distributions of noisy gene expression from limited observations by deep learning. GEN Biotechnol 1(6):504–513CrossRef
53.
go back to reference Qi Z, Tian Y, Shi Y (2013) Robust twin support vector machine for pattern classification. Pattern Recognit 46(1):305–316CrossRef Qi Z, Tian Y, Shi Y (2013) Robust twin support vector machine for pattern classification. Pattern Recognit 46(1):305–316CrossRef
54.
go back to reference Qi Z, Tian Y, Shi Y (2013) Structural twin support vector machine for classification. Knowl Based Syst 43:74–81CrossRef Qi Z, Tian Y, Shi Y (2013) Structural twin support vector machine for classification. Knowl Based Syst 43:74–81CrossRef
55.
go back to reference Tian Y, Shi Y, Liu X (2012) Recent advances on support vector machines research. Technol Econ Dev Econ 18(1):5–33CrossRef Tian Y, Shi Y, Liu X (2012) Recent advances on support vector machines research. Technol Econ Dev Econ 18(1):5–33CrossRef
56.
go back to reference Iranmehr A, Masnadi-Shirazi H, Vasconcelos N (2019) Cost-sensitive support vector machines. Neurocomputing (Amst) 343:50–64CrossRef Iranmehr A, Masnadi-Shirazi H, Vasconcelos N (2019) Cost-sensitive support vector machines. Neurocomputing (Amst) 343:50–64CrossRef
60.
go back to reference Miao J, Ping Y, Chen Z et al (2021) Unsupervised feature selection by non-convex regularized self-representation. Expert Syst Appl 173:114643CrossRef Miao J, Ping Y, Chen Z et al (2021) Unsupervised feature selection by non-convex regularized self-representation. Expert Syst Appl 173:114643CrossRef
61.
go back to reference Miao J, Yang T, Sun L et al (2022) Graph regularized locally linear embedding for unsupervised feature selection. Pattern Recognit 122:108299CrossRef Miao J, Yang T, Sun L et al (2022) Graph regularized locally linear embedding for unsupervised feature selection. Pattern Recognit 122:108299CrossRef
Metadata
Title
Comparative Analysis of Machine Learning Techniques for Imbalanced Genetic Data
Authors
Arshmeet Kaur
Morteza Sarmadi
Publication date
13-08-2024
Publisher
Springer Berlin Heidelberg
Published in
Annals of Data Science
Print ISSN: 2198-5804
Electronic ISSN: 2198-5812
DOI
https://doi.org/10.1007/s40745-024-00575-8

Premium Partner