Skip to main content
Top
Published in: Journal of Classification 1/2024

28-02-2024

Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

Authors: Aboubacry Gaye, Abdou Ka Diongue, Seydou Nourou Sylla, Maryam Diarra, Amadou Diallo, Cheikh Talla, Cheikh Loucoubar

Published in: Journal of Classification | Issue 1/2024

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2), 263–265.CrossRef Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2), 263–265.CrossRef
go back to reference Bickel, P. J., & Levina, E. (2004). Some theory for Fisher’s linear discriminant function, naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6), 989–1010.MathSciNetCrossRef Bickel, P. J., & Levina, E. (2004). Some theory for Fisher’s linear discriminant function, naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6), 989–1010.MathSciNetCrossRef
go back to reference Bron, C., & Kerbosch, J. (1973). Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM, 16(9), 575–577.CrossRef Bron, C., & Kerbosch, J. (1973). Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM, 16(9), 575–577.CrossRef
go back to reference Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pp. 493–507. Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pp. 493–507.
go back to reference Duin, R. P., & Loog, M. (2004). Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion. IEEE transactions on pattern analysis and machine intelligence, 26(6), 732–739.CrossRef Duin, R. P., & Loog, M. (2004). Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion. IEEE transactions on pattern analysis and machine intelligence, 26(6), 732–739.CrossRef
go back to reference Fisher, R. A. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22(5), 700–725.CrossRef Fisher, R. A. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22(5), 700–725.CrossRef
go back to reference Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. (2002). The structure of haplotype blocks in the human genome. Science, 296(5576), 2225–2229.CrossRef Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. (2002). The structure of haplotype blocks in the human genome. Science, 296(5576), 2225–2229.CrossRef
go back to reference Geeleher, P., Cox, N. J., & Huang, R. S. (2014). Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biology, 15, 1–12.CrossRef Geeleher, P., Cox, N. J., & Huang, R. S. (2014). Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biology, 15, 1–12.CrossRef
go back to reference Jolliffe, I. T. (1986). Principal components in regression analysis. In Principal component analysis, pp 129–155. Springer. Jolliffe, I. T. (1986). Principal components in regression analysis. In Principal component analysis, pp 129–155. Springer.
go back to reference Julier, S. J. (2006). An empirical study into the use of Chernoff information for robust, distributed fusion of gaussian mixture models. In 2006 9th International Conference on Information Fusion, pp 1–8. IEEE. Julier, S. J. (2006). An empirical study into the use of Chernoff information for robust, distributed fusion of gaussian mixture models. In 2006 9th International Conference on Information Fusion, pp 1–8. IEEE.
go back to reference Kim, S. A., Cho, C.-S., Kim, S.-R., Bull, S. B., & Yoo, Y. J. (2017). A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs. Bioinformatics, 34(3), 388–397.CrossRef Kim, S. A., Cho, C.-S., Kim, S.-R., Bull, S. B., & Yoo, Y. J. (2017). A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs. Bioinformatics, 34(3), 388–397.CrossRef
go back to reference Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, & K. Weinberger (Eds.), Advances in Neural Information Processing Systems. (Vol. 25). Curran Associates Inc. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, & K. Weinberger (Eds.), Advances in Neural Information Processing Systems. (Vol. 25). Curran Associates Inc.
go back to reference Lewontin, R. C. (1964). The interaction of selection and linkage. I general considerations; heterotic models. Genetics, 49(1), 49–67. Lewontin, R. C. (1964). The interaction of selection and linkage. I general considerations; heterotic models. Genetics, 49(1), 49–67.
go back to reference Liu, F., Schmidt, R. H., Reif, J. C., & Jiang, Y. (2019). Selecting closely-linked SNPs based on local epistatic effects for haplotype construction improves power of association mapping. G3: Genes, Genomes, Genetics, 9(12), 4115–4126.CrossRef Liu, F., Schmidt, R. H., Reif, J. C., & Jiang, Y. (2019). Selecting closely-linked SNPs based on local epistatic effects for haplotype construction improves power of association mapping. G3: Genes, Genomes, Genetics, 9(12), 4115–4126.CrossRef
go back to reference Maherin, I., & Liang, Q. (2014). Radar sensor network for target detection using Chernoff information and relative entropy. Physical Communication, 13, 244–252.CrossRef Maherin, I., & Liang, Q. (2014). Radar sensor network for target detection using Chernoff information and relative entropy. Physical Communication, 13, 244–252.CrossRef
go back to reference Motsinger, A. A., Reif, D. M., Fanelli, T. J., Davis, A. C., & Ritchie, M. D. (2007). Linkage disequilibrium in genetic association studies improves the performance of grammatical evolution neural networks. In 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, pp. 1–8. IEEE. Motsinger, A. A., Reif, D. M., Fanelli, T. J., Davis, A. C., & Ritchie, M. D. (2007). Linkage disequilibrium in genetic association studies improves the performance of grammatical evolution neural networks. In 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, pp. 1–8. IEEE.
go back to reference Nielsen, F. (2022). Revisiting Chernoff information with likelihood ratio exponential families. Entropy, 24(10), 1400.MathSciNetCrossRef Nielsen, F. (2022). Revisiting Chernoff information with likelihood ratio exponential families. Entropy, 24(10), 1400.MathSciNetCrossRef
go back to reference Pattaro, C., Ruczinski, I., Fallin, D. M., & Parmigiani, G. (2008). Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies. BMC genomics, 9(1), 1–15.CrossRef Pattaro, C., Ruczinski, I., Fallin, D. M., & Parmigiani, G. (2008). Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies. BMC genomics, 9(1), 1–15.CrossRef
go back to reference Schaid, D. J., McDonnell, S. K., Hebbring, S. J., Cunningham, J. M., & Thibodeau, S. N. (2005). Nonparametric tests of association of multiple genes with human disease. The American Journal of Human Genetics, 76(5), 780–793.CrossRef Schaid, D. J., McDonnell, S. K., Hebbring, S. J., Cunningham, J. M., & Thibodeau, S. N. (2005). Nonparametric tests of association of multiple genes with human disease. The American Journal of Human Genetics, 76(5), 780–793.CrossRef
go back to reference Slatkin, M. (2008). Linkage disequilibrium—Understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477–485.CrossRef Slatkin, M. (2008). Linkage disequilibrium—Understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477–485.CrossRef
go back to reference Soh, K. P., Szczurek, E., Sakoparnig, T., & Beerenwinkel, N. (2017). Predicting cancer type from tumour DNA signatures. Genome Medicine, 9(1), 1–11.CrossRef Soh, K. P., Szczurek, E., Sakoparnig, T., & Beerenwinkel, N. (2017). Predicting cancer type from tumour DNA signatures. Genome Medicine, 9(1), 1–11.CrossRef
go back to reference Taliun, D., Gamper, J., Leser, U., & Pattaro, C. (2015). Fast sampling-based whole-genome haplotype block recognition. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13(2), 315–325.CrossRef Taliun, D., Gamper, J., Leser, U., & Pattaro, C. (2015). Fast sampling-based whole-genome haplotype block recognition. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13(2), 315–325.CrossRef
go back to reference Taliun, D., Gamper, J., & Pattaro, C. (2014). Efficient haplotype block recognition of very long and dense genetic sequences. BMC Bioinformatics, 15(1), 1–18.CrossRef Taliun, D., Gamper, J., & Pattaro, C. (2014). Efficient haplotype block recognition of very long and dense genetic sequences. BMC Bioinformatics, 15(1), 1–18.CrossRef
go back to reference Vattikuti, S., Guo, J., & Chow, C. C. (2012). Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS Genetics, 8(3), e1002637.CrossRef Vattikuti, S., Guo, J., & Chow, C. C. (2012). Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS Genetics, 8(3), e1002637.CrossRef
go back to reference Vogelstein, J. T., Bridgeford, E. W., Tang, M., Zheng, D., Douville, C., Burns, R., & Maggioni, M. (2021). Supervised dimensionality reduction for big data. Nature Communications, 12(1), 2872.CrossRef Vogelstein, J. T., Bridgeford, E. W., Tang, M., Zheng, D., Douville, C., Burns, R., & Maggioni, M. (2021). Supervised dimensionality reduction for big data. Nature Communications, 12(1), 2872.CrossRef
go back to reference Vogelstein, J. T., Park, Y., Ohyama, T., Kerr, R. A., Truman, J. W., Priebe, C. E., & Zlatic, M. (2014). Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science, 344(6182), 386–392.CrossRef Vogelstein, J. T., Park, Y., Ohyama, T., Kerr, R. A., Truman, J. W., Priebe, C. E., & Zlatic, M. (2014). Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science, 344(6182), 386–392.CrossRef
go back to reference Wang, N., Akey, J. M., Zhang, K., Chakraborty, R., & Jin, L. (2002). Distribution of recombination crossovers and the origin of haplotype blocks: The interplay of population history, recombination, and mutation. The American Journal of Human Genetics, 71(5), 1227–1234.CrossRef Wang, N., Akey, J. M., Zhang, K., Chakraborty, R., & Jin, L. (2002). Distribution of recombination crossovers and the origin of haplotype blocks: The interplay of population history, recombination, and mutation. The American Journal of Human Genetics, 71(5), 1227–1234.CrossRef
go back to reference Wright, G., Tan, B., Rosenwald, A., Hurt, E. H., Wiestner, A., & Staudt, L. M. (2003). A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proceedings of the National Academy of Sciences, 100(17), 9991–9996.CrossRef Wright, G., Tan, B., Rosenwald, A., Hurt, E. H., Wiestner, A., & Staudt, L. M. (2003). A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proceedings of the National Academy of Sciences, 100(17), 9991–9996.CrossRef
go back to reference Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14, 1–37.CrossRef Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14, 1–37.CrossRef
go back to reference Xing, L., Joun, S., Mackay, K., Lesperance, M., and Zhang, X. (2022). Handling highly correlated genes in prediction analysis of genomic studies. Preprint, arXiv Machine Learning (Statistics). Xing, L., Joun, S., Mackay, K., Lesperance, M., and Zhang, X. (2022). Handling highly correlated genes in prediction analysis of genomic studies. Preprint, arXiv Machine Learning (Statistics).
go back to reference Yoo, Y. J., Kim, S. A., & Bull, S. B. (2015). Clique-based clustering of correlated SNPs in a gene can improve performance of gene-based multi-bin linear combination test. BioMed Research International, 2015, 852341.CrossRef Yoo, Y. J., Kim, S. A., & Bull, S. B. (2015). Clique-based clustering of correlated SNPs in a gene can improve performance of gene-based multi-bin linear combination test. BioMed Research International, 2015, 852341.CrossRef
go back to reference Yoo, Y. J., Sun, L., & Bull, S. B. (2013). Gene-based multiple regression association testing for combined examination of common and low frequency variants in quantitative trait analysis. Frontiers in Genetics, 4, 233.CrossRef Yoo, Y. J., Sun, L., & Bull, S. B. (2013). Gene-based multiple regression association testing for combined examination of common and low frequency variants in quantitative trait analysis. Frontiers in Genetics, 4, 233.CrossRef
go back to reference Yoo, Y. J., Sun, L., Poirier, J., & Bull, S. B. (2014). Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association. Technical report, Tech: Rep., Department of Statistical Sciences, University of Toronto. Yoo, Y. J., Sun, L., Poirier, J., & Bull, S. B. (2014). Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association. Technical report, Tech: Rep., Department of Statistical Sciences, University of Toronto.
Metadata
Title
Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data
Authors
Aboubacry Gaye
Abdou Ka Diongue
Seydou Nourou Sylla
Maryam Diarra
Amadou Diallo
Cheikh Talla
Cheikh Loucoubar
Publication date
28-02-2024
Publisher
Springer US
Published in
Journal of Classification / Issue 1/2024
Print ISSN: 0176-4268
Electronic ISSN: 1432-1343
DOI
https://doi.org/10.1007/s00357-024-09463-5

Other articles of this Issue 1/2024

Journal of Classification 1/2024 Go to the issue

Premium Partner