Skip to main content
Erschienen in: Journal of Classification 1/2024

28.02.2024

Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data

verfasst von: Aboubacry Gaye, Abdou Ka Diongue, Seydou Nourou Sylla, Maryam Diarra, Amadou Diallo, Cheikh Talla, Cheikh Loucoubar

Erschienen in: Journal of Classification | Ausgabe 1/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This work addresses the problem of supervised classification for high-dimensional and highly correlated data using correlation blocks and supervised dimension reduction. We propose a method that combines block partitioning based on interval graph modeling and an extension of principal component analysis (PCA) incorporating conditional class moment estimates in the low-dimensional projection. Block partitioning allows us to handle the high correlation of our data by grouping them into blocks where the correlation within the same block is maximized and the correlation between variables in different blocks is minimized. The extended PCA allows us to perform low-dimensional projection and clustering supervised. Applied to gene expression data from 445 individuals divided into two groups (diseased and non-diseased) and 719,656 single nucleotide polymorphisms (SNPs), this method shows good clustering and prediction performances. SNPs are a type of genetic variation that represents a difference in a single deoxyribonucleic acid (DNA) building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct population origin of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is crucial. The classical statistical models currently used in the field of genome-wide association studies (GWAS) have shown their limitations in detecting genes of interest in the study of complex diseases such as asthma or malaria. In this study, we first investigate a linkage disequilibrium (LD) block partition method based on interval graph modeling to handle the high correlation between SNPs. Then, we use supervised approaches, in particular, the approach that extends PCA by incorporating conditional class moment estimates in the low-dimensional projection, to identify the determining SNPs in malaria episodes. Experimental results obtained on the Dielmo-Ndiop project dataset show that the linear discriminant analysis (LDA) approach has significantly high accuracy in predicting malaria episodes.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2), 263–265.CrossRef Barrett, J. C., Fry, B., Maller, J., & Daly, M. J. (2005). Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2), 263–265.CrossRef
Zurück zum Zitat Bickel, P. J., & Levina, E. (2004). Some theory for Fisher’s linear discriminant function, naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6), 989–1010.MathSciNetCrossRef Bickel, P. J., & Levina, E. (2004). Some theory for Fisher’s linear discriminant function, naive Bayes’, and some alternatives when there are many more variables than observations. Bernoulli, 10(6), 989–1010.MathSciNetCrossRef
Zurück zum Zitat Bron, C., & Kerbosch, J. (1973). Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM, 16(9), 575–577.CrossRef Bron, C., & Kerbosch, J. (1973). Algorithm 457: Finding all cliques of an undirected graph. Communications of the ACM, 16(9), 575–577.CrossRef
Zurück zum Zitat Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pp. 493–507. Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, pp. 493–507.
Zurück zum Zitat Duin, R. P., & Loog, M. (2004). Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion. IEEE transactions on pattern analysis and machine intelligence, 26(6), 732–739.CrossRef Duin, R. P., & Loog, M. (2004). Linear dimensionality reduction via a heteroscedastic extension of LDA: the Chernoff criterion. IEEE transactions on pattern analysis and machine intelligence, 26(6), 732–739.CrossRef
Zurück zum Zitat Fisher, R. A. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22(5), 700–725.CrossRef Fisher, R. A. (1925). Theory of statistical estimation. Mathematical Proceedings of the Cambridge Philosophical Society, 22(5), 700–725.CrossRef
Zurück zum Zitat Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. (2002). The structure of haplotype blocks in the human genome. Science, 296(5576), 2225–2229.CrossRef Gabriel, S. B., Schaffner, S. F., Nguyen, H., Moore, J. M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. (2002). The structure of haplotype blocks in the human genome. Science, 296(5576), 2225–2229.CrossRef
Zurück zum Zitat Geeleher, P., Cox, N. J., & Huang, R. S. (2014). Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biology, 15, 1–12.CrossRef Geeleher, P., Cox, N. J., & Huang, R. S. (2014). Clinical drug response can be predicted using baseline gene expression levels and in vitro drug sensitivity in cell lines. Genome Biology, 15, 1–12.CrossRef
Zurück zum Zitat Jolliffe, I. T. (1986). Principal components in regression analysis. In Principal component analysis, pp 129–155. Springer. Jolliffe, I. T. (1986). Principal components in regression analysis. In Principal component analysis, pp 129–155. Springer.
Zurück zum Zitat Julier, S. J. (2006). An empirical study into the use of Chernoff information for robust, distributed fusion of gaussian mixture models. In 2006 9th International Conference on Information Fusion, pp 1–8. IEEE. Julier, S. J. (2006). An empirical study into the use of Chernoff information for robust, distributed fusion of gaussian mixture models. In 2006 9th International Conference on Information Fusion, pp 1–8. IEEE.
Zurück zum Zitat Kim, S. A., Cho, C.-S., Kim, S.-R., Bull, S. B., & Yoo, Y. J. (2017). A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs. Bioinformatics, 34(3), 388–397.CrossRef Kim, S. A., Cho, C.-S., Kim, S.-R., Bull, S. B., & Yoo, Y. J. (2017). A new haplotype block detection method for dense genome sequencing data based on interval graph modeling of clusters of highly correlated SNPs. Bioinformatics, 34(3), 388–397.CrossRef
Zurück zum Zitat Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, & K. Weinberger (Eds.), Advances in Neural Information Processing Systems. (Vol. 25). Curran Associates Inc. Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, & K. Weinberger (Eds.), Advances in Neural Information Processing Systems. (Vol. 25). Curran Associates Inc.
Zurück zum Zitat Lewontin, R. C. (1964). The interaction of selection and linkage. I general considerations; heterotic models. Genetics, 49(1), 49–67. Lewontin, R. C. (1964). The interaction of selection and linkage. I general considerations; heterotic models. Genetics, 49(1), 49–67.
Zurück zum Zitat Liu, F., Schmidt, R. H., Reif, J. C., & Jiang, Y. (2019). Selecting closely-linked SNPs based on local epistatic effects for haplotype construction improves power of association mapping. G3: Genes, Genomes, Genetics, 9(12), 4115–4126.CrossRef Liu, F., Schmidt, R. H., Reif, J. C., & Jiang, Y. (2019). Selecting closely-linked SNPs based on local epistatic effects for haplotype construction improves power of association mapping. G3: Genes, Genomes, Genetics, 9(12), 4115–4126.CrossRef
Zurück zum Zitat Maherin, I., & Liang, Q. (2014). Radar sensor network for target detection using Chernoff information and relative entropy. Physical Communication, 13, 244–252.CrossRef Maherin, I., & Liang, Q. (2014). Radar sensor network for target detection using Chernoff information and relative entropy. Physical Communication, 13, 244–252.CrossRef
Zurück zum Zitat Motsinger, A. A., Reif, D. M., Fanelli, T. J., Davis, A. C., & Ritchie, M. D. (2007). Linkage disequilibrium in genetic association studies improves the performance of grammatical evolution neural networks. In 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, pp. 1–8. IEEE. Motsinger, A. A., Reif, D. M., Fanelli, T. J., Davis, A. C., & Ritchie, M. D. (2007). Linkage disequilibrium in genetic association studies improves the performance of grammatical evolution neural networks. In 2007 IEEE Symposium on Computational Intelligence and Bioinformatics and Computational Biology, pp. 1–8. IEEE.
Zurück zum Zitat Nielsen, F. (2022). Revisiting Chernoff information with likelihood ratio exponential families. Entropy, 24(10), 1400.MathSciNetCrossRef Nielsen, F. (2022). Revisiting Chernoff information with likelihood ratio exponential families. Entropy, 24(10), 1400.MathSciNetCrossRef
Zurück zum Zitat Pattaro, C., Ruczinski, I., Fallin, D. M., & Parmigiani, G. (2008). Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies. BMC genomics, 9(1), 1–15.CrossRef Pattaro, C., Ruczinski, I., Fallin, D. M., & Parmigiani, G. (2008). Haplotype block partitioning as a tool for dimensionality reduction in SNP association studies. BMC genomics, 9(1), 1–15.CrossRef
Zurück zum Zitat Schaid, D. J., McDonnell, S. K., Hebbring, S. J., Cunningham, J. M., & Thibodeau, S. N. (2005). Nonparametric tests of association of multiple genes with human disease. The American Journal of Human Genetics, 76(5), 780–793.CrossRef Schaid, D. J., McDonnell, S. K., Hebbring, S. J., Cunningham, J. M., & Thibodeau, S. N. (2005). Nonparametric tests of association of multiple genes with human disease. The American Journal of Human Genetics, 76(5), 780–793.CrossRef
Zurück zum Zitat Slatkin, M. (2008). Linkage disequilibrium—Understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477–485.CrossRef Slatkin, M. (2008). Linkage disequilibrium—Understanding the evolutionary past and mapping the medical future. Nature Reviews Genetics, 9(6), 477–485.CrossRef
Zurück zum Zitat Soh, K. P., Szczurek, E., Sakoparnig, T., & Beerenwinkel, N. (2017). Predicting cancer type from tumour DNA signatures. Genome Medicine, 9(1), 1–11.CrossRef Soh, K. P., Szczurek, E., Sakoparnig, T., & Beerenwinkel, N. (2017). Predicting cancer type from tumour DNA signatures. Genome Medicine, 9(1), 1–11.CrossRef
Zurück zum Zitat Taliun, D., Gamper, J., Leser, U., & Pattaro, C. (2015). Fast sampling-based whole-genome haplotype block recognition. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13(2), 315–325.CrossRef Taliun, D., Gamper, J., Leser, U., & Pattaro, C. (2015). Fast sampling-based whole-genome haplotype block recognition. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13(2), 315–325.CrossRef
Zurück zum Zitat Taliun, D., Gamper, J., & Pattaro, C. (2014). Efficient haplotype block recognition of very long and dense genetic sequences. BMC Bioinformatics, 15(1), 1–18.CrossRef Taliun, D., Gamper, J., & Pattaro, C. (2014). Efficient haplotype block recognition of very long and dense genetic sequences. BMC Bioinformatics, 15(1), 1–18.CrossRef
Zurück zum Zitat Vattikuti, S., Guo, J., & Chow, C. C. (2012). Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS Genetics, 8(3), e1002637.CrossRef Vattikuti, S., Guo, J., & Chow, C. C. (2012). Heritability and genetic correlations explained by common SNPs for metabolic syndrome traits. PLoS Genetics, 8(3), e1002637.CrossRef
Zurück zum Zitat Vogelstein, J. T., Bridgeford, E. W., Tang, M., Zheng, D., Douville, C., Burns, R., & Maggioni, M. (2021). Supervised dimensionality reduction for big data. Nature Communications, 12(1), 2872.CrossRef Vogelstein, J. T., Bridgeford, E. W., Tang, M., Zheng, D., Douville, C., Burns, R., & Maggioni, M. (2021). Supervised dimensionality reduction for big data. Nature Communications, 12(1), 2872.CrossRef
Zurück zum Zitat Vogelstein, J. T., Park, Y., Ohyama, T., Kerr, R. A., Truman, J. W., Priebe, C. E., & Zlatic, M. (2014). Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science, 344(6182), 386–392.CrossRef Vogelstein, J. T., Park, Y., Ohyama, T., Kerr, R. A., Truman, J. W., Priebe, C. E., & Zlatic, M. (2014). Discovery of brainwide neural-behavioral maps via multiscale unsupervised structure learning. Science, 344(6182), 386–392.CrossRef
Zurück zum Zitat Wang, N., Akey, J. M., Zhang, K., Chakraborty, R., & Jin, L. (2002). Distribution of recombination crossovers and the origin of haplotype blocks: The interplay of population history, recombination, and mutation. The American Journal of Human Genetics, 71(5), 1227–1234.CrossRef Wang, N., Akey, J. M., Zhang, K., Chakraborty, R., & Jin, L. (2002). Distribution of recombination crossovers and the origin of haplotype blocks: The interplay of population history, recombination, and mutation. The American Journal of Human Genetics, 71(5), 1227–1234.CrossRef
Zurück zum Zitat Wright, G., Tan, B., Rosenwald, A., Hurt, E. H., Wiestner, A., & Staudt, L. M. (2003). A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proceedings of the National Academy of Sciences, 100(17), 9991–9996.CrossRef Wright, G., Tan, B., Rosenwald, A., Hurt, E. H., Wiestner, A., & Staudt, L. M. (2003). A gene expression-based method to diagnose clinically distinct subgroups of diffuse large B cell lymphoma. Proceedings of the National Academy of Sciences, 100(17), 9991–9996.CrossRef
Zurück zum Zitat Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14, 1–37.CrossRef Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G. J., Ng, A., Liu, B., Yu, P. S., et al. (2008). Top 10 algorithms in data mining. Knowledge and Information Systems, 14, 1–37.CrossRef
Zurück zum Zitat Xing, L., Joun, S., Mackay, K., Lesperance, M., and Zhang, X. (2022). Handling highly correlated genes in prediction analysis of genomic studies. Preprint, arXiv Machine Learning (Statistics). Xing, L., Joun, S., Mackay, K., Lesperance, M., and Zhang, X. (2022). Handling highly correlated genes in prediction analysis of genomic studies. Preprint, arXiv Machine Learning (Statistics).
Zurück zum Zitat Yoo, Y. J., Kim, S. A., & Bull, S. B. (2015). Clique-based clustering of correlated SNPs in a gene can improve performance of gene-based multi-bin linear combination test. BioMed Research International, 2015, 852341.CrossRef Yoo, Y. J., Kim, S. A., & Bull, S. B. (2015). Clique-based clustering of correlated SNPs in a gene can improve performance of gene-based multi-bin linear combination test. BioMed Research International, 2015, 852341.CrossRef
Zurück zum Zitat Yoo, Y. J., Sun, L., & Bull, S. B. (2013). Gene-based multiple regression association testing for combined examination of common and low frequency variants in quantitative trait analysis. Frontiers in Genetics, 4, 233.CrossRef Yoo, Y. J., Sun, L., & Bull, S. B. (2013). Gene-based multiple regression association testing for combined examination of common and low frequency variants in quantitative trait analysis. Frontiers in Genetics, 4, 233.CrossRef
Zurück zum Zitat Yoo, Y. J., Sun, L., Poirier, J., & Bull, S. B. (2014). Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association. Technical report, Tech: Rep., Department of Statistical Sciences, University of Toronto. Yoo, Y. J., Sun, L., Poirier, J., & Bull, S. B. (2014). Multi-bin multi-variant tests for gene-based linear regression analysis of genetic association. Technical report, Tech: Rep., Department of Statistical Sciences, University of Toronto.
Metadaten
Titel
Supervised Classification of High-Dimensional Correlated Data: Application to Genomic Data
verfasst von
Aboubacry Gaye
Abdou Ka Diongue
Seydou Nourou Sylla
Maryam Diarra
Amadou Diallo
Cheikh Talla
Cheikh Loucoubar
Publikationsdatum
28.02.2024
Verlag
Springer US
Erschienen in
Journal of Classification / Ausgabe 1/2024
Print ISSN: 0176-4268
Elektronische ISSN: 1432-1343
DOI
https://doi.org/10.1007/s00357-024-09463-5

Weitere Artikel der Ausgabe 1/2024

Journal of Classification 1/2024 Zur Ausgabe

Premium Partner