Skip to main content

2022 | OriginalPaper | Buchkapitel

A Fast, Provably Accurate Approximation Algorithm for Sparse Principal Component Analysis Reveals Human Genetic Variation Across the World

verfasst von : Agniva Chowdhury, Aritra Bose, Samson Zhou, David P. Woodruff, Petros Drineas

Erschienen in: Research in Computational Molecular Biology

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Principal component analysis (PCA) is a widely used dimensionality reduction technique in machine learning and multivariate statistics. To improve the interpretability of PCA, various approaches to obtain sparse principal direction loadings have been proposed, which are termed Sparse Principal Component Analysis (SPCA). In this paper, we present ThreSPCA, a provably accurate algorithm based on thresholding the Singular Value Decomposition for the SPCA problem, without imposing any restrictive assumptions on the input covariance matrix. Our thresholding algorithm is conceptually simple; much faster than current state-of-the-art; and performs well in practice. When applied to genotype data from the 1000 Genomes Project, ThreSPCA is faster than previous benchmarks, at least as accurate, and leads to a set of interpretable biomarkers, revealing genetic diversity across the world.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
Recall that the p-th power of the \(\ell _p\) norm of a vector \({\mathbf {x}} \in \mathbb {R}^n\) is defined as \(\Vert \mathbf {x}\Vert _p^p = \sum _{i=1}^n {\left| {\mathbf {x}} _i\right| } ^p\) for \(0<p<\infty \). For \(p=0\), \(\Vert x\Vert _0\) is a semi-norm denoting the number of non-zero entries of x.
 
2
Each column of \(\mathbf {R}\) has a single non-zero entry (set to one), corresponding to one of the |R| selected columns. Formally, \(\mathbf {R}_{i_t,t}=1\) for \(t=1,\ldots , |R|\); all other entries of \(\mathbf {R}\) are set to zero.
 
3
Results from L-PCA are qualitatively very similar to AL-PCA and we only report results for the latter.
 
Literatur
1.
Zurück zum Zitat Asteris, M., Papailiopoulos, D., Karystinos, G.N.: Sparse principal component of a rank-deficient matrix. In: 2011 IEEE International Symposium on Information Theory Proceedings, pp. 673–677 (2011) Asteris, M., Papailiopoulos, D., Karystinos, G.N.: Sparse principal component of a rank-deficient matrix. In: 2011 IEEE International Symposium on Information Theory Proceedings, pp. 673–677 (2011)
2.
Zurück zum Zitat Asteris, M., Papailiopoulos, D., Kyrillidis, A., Dimakis, A.G.: Sparse PCA via bipartite matchings. In: Advances in Neural Information Processing Systems, pp. 766–774 (2015) Asteris, M., Papailiopoulos, D., Kyrillidis, A., Dimakis, A.G.: Sparse PCA via bipartite matchings. In: Advances in Neural Information Processing Systems, pp. 766–774 (2015)
4.
Zurück zum Zitat Bose, A., Burch, M.C., Chowdhury, A., Paschou, P., Drineas, P.: Clustrat: a structure informed clustering strategy for population stratification. bioRxiv (2020) Bose, A., Burch, M.C., Chowdhury, A., Paschou, P., Drineas, P.: Clustrat: a structure informed clustering strategy for population stratification. bioRxiv (2020)
5.
Zurück zum Zitat Bose, A., Kalantzis, V., Kontopoulou, E.M., Elkady, M., Paschou, P., Drineas, P.: TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics 35(19), 3679–3683 (2019)CrossRef Bose, A., Kalantzis, V., Kontopoulou, E.M., Elkady, M., Paschou, P., Drineas, P.: TeraPCA: a fast and scalable software package to study genetic variation in tera-scale genotypes. Bioinformatics 35(19), 3679–3683 (2019)CrossRef
6.
Zurück zum Zitat Buniello, A., et al.: The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47(D1), D1005–D1012 (2019)CrossRef Buniello, A., et al.: The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 47(D1), D1005–D1012 (2019)CrossRef
7.
Zurück zum Zitat Cadima, J., Jolliffe, I.T.: Loading and correlations in the interpretation of principal components. J. Appl. Stat. 22(2), 203–214 (1995)MathSciNetCrossRef Cadima, J., Jolliffe, I.T.: Loading and correlations in the interpretation of principal components. J. Appl. Stat. 22(2), 203–214 (1995)MathSciNetCrossRef
8.
Zurück zum Zitat Chan, S.O., Papailliopoulos, D., Rubinstein, A.: On the approximability of sparse PCA. In: Proceedings of the 29th Conference on Learning Theory, pp. 623–646 (2016) Chan, S.O., Papailliopoulos, D., Rubinstein, A.: On the approximability of sparse PCA. In: Proceedings of the 29th Conference on Learning Theory, pp. 623–646 (2016)
9.
Zurück zum Zitat Chang, C.C., Chow, C.C., Tellier, L.C., Vattikuti, S., Purcell, S.M., Lee, J.J.: Second-generation plink: rising to the challenge of larger and richer datasets. Gigascience 4(1), s13742-015 (2015)CrossRef Chang, C.C., Chow, C.C., Tellier, L.C., Vattikuti, S., Purcell, S.M., Lee, J.J.: Second-generation plink: rising to the challenge of larger and richer datasets. Gigascience 4(1), s13742-015 (2015)CrossRef
10.
Zurück zum Zitat Consortium, G.P., et al.: A global reference for human genetic variation. Nature 526(7571), 68 (2015) Consortium, G.P., et al.: A global reference for human genetic variation. Nature 526(7571), 68 (2015)
11.
Zurück zum Zitat d’Aspremont, A., Ghaoui, L.E., Jordan, M.I., Lanckriet, G.R.G.: A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49(3), 434–448 (2007)MathSciNetCrossRef d’Aspremont, A., Ghaoui, L.E., Jordan, M.I., Lanckriet, G.R.G.: A direct formulation for sparse PCA using semidefinite programming. SIAM Rev. 49(3), 434–448 (2007)MathSciNetCrossRef
12.
Zurück zum Zitat Engelhardt, B.E., Stephens, M.: Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS Genet. 6(9), e1001117 (2010)CrossRef Engelhardt, B.E., Stephens, M.: Analysis of population structure: a unifying framework and novel methods based on sparse factor analysis. PLoS Genet. 6(9), e1001117 (2010)CrossRef
13.
Zurück zum Zitat Hsu, Y.L., Huang, P.Y., Chen, D.T.: Sparse principal component analysis in cancer research. Transl. Cancer Res. 3(3), 182 (2014) Hsu, Y.L., Huang, P.Y., Chen, D.T.: Sparse principal component analysis in cancer research. Transl. Cancer Res. 3(3), 182 (2014)
14.
Zurück zum Zitat Jolliffe, I.T.: Rotation of principal components: choice of normalization constraints. J. Appl. Stat. 22(1), 29–35 (1995)MathSciNetCrossRef Jolliffe, I.T.: Rotation of principal components: choice of normalization constraints. J. Appl. Stat. 22(1), 29–35 (1995)MathSciNetCrossRef
15.
Zurück zum Zitat Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12(3), 531–547 (2003)MathSciNetCrossRef Jolliffe, I.T., Trendafilov, N.T., Uddin, M.: A modified principal component technique based on the LASSO. J. Comput. Graph. Stat. 12(3), 531–547 (2003)MathSciNetCrossRef
16.
Zurück zum Zitat Lee, S., et al.: Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies. Genet. Epidemiol. 36(4), 293–302 (2012)CrossRef Lee, S., et al.: Sparse principal component analysis for identifying ancestry-informative markers in genome-wide association studies. Genet. Epidemiol. 36(4), 293–302 (2012)CrossRef
17.
Zurück zum Zitat Li, J.Z., et al.: Worldwide human relationships inferred from genome-wide patterns of variation. Science 319(5866), 1100–1104 (2008)CrossRef Li, J.Z., et al.: Worldwide human relationships inferred from genome-wide patterns of variation. Science 319(5866), 1100–1104 (2008)CrossRef
18.
Zurück zum Zitat Mahoney, M.W., Drineas, P.: CUR matrix decompositions for improved data analysis. Proc. Natl. Acad. Sci. 106(3), 697–702 (2009)MathSciNetCrossRef Mahoney, M.W., Drineas, P.: CUR matrix decompositions for improved data analysis. Proc. Natl. Acad. Sci. 106(3), 697–702 (2009)MathSciNetCrossRef
19.
Zurück zum Zitat McLaren, W., et al.: The ensembl variant effect predictor. Genome Biol. 17(1), 1–14 (2016)CrossRef McLaren, W., et al.: The ensembl variant effect predictor. Genome Biol. 17(1), 1–14 (2016)CrossRef
20.
Zurück zum Zitat Moghaddam, B., Weiss, Y., Avidan, S.: Generalized spectral bounds for sparse LDA. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 641–648 (2006) Moghaddam, B., Weiss, Y., Avidan, S.: Generalized spectral bounds for sparse LDA. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 641–648 (2006)
21.
Zurück zum Zitat Musco, C., Musco, C.: Randomized block krylov methods for stronger and faster approximate singular value decomposition. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, pp. 1396–1404 (2015) Musco, C., Musco, C.: Randomized block krylov methods for stronger and faster approximate singular value decomposition. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, pp. 1396–1404 (2015)
22.
Zurück zum Zitat Papailiopoulos, D., Dimakis, A., Korokythakis, S.: Sparse PCA through low-rank approximations. In: Proceedings of the 30th International Conference on Machine Learning, pp. 747–755 (2013) Papailiopoulos, D., Dimakis, A., Korokythakis, S.: Sparse PCA through low-rank approximations. In: Proceedings of the 30th International Conference on Machine Learning, pp. 747–755 (2013)
23.
Zurück zum Zitat Patterson, N., Price, A.L., Reich, D.: Population structure and eigenanalysis. PLoS Genet. 2(12), e190 (2006) Patterson, N., Price, A.L., Reich, D.: Population structure and eigenanalysis. PLoS Genet. 2(12), e190 (2006)
24.
Zurück zum Zitat Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.: Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38(8), 904–909 (2006)CrossRef Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A., Reich, D.: Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38(8), 904–909 (2006)CrossRef
25.
Zurück zum Zitat Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000)CrossRef Pritchard, J.K., Stephens, M., Donnelly, P.: Inference of population structure using multilocus genotype data. Genetics 155(2), 945–959 (2000)CrossRef
26.
Zurück zum Zitat Sohail, M., et al.: Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. Elife 8, e39702 (2019)CrossRef Sohail, M., et al.: Polygenic adaptation on height is overestimated due to uncorrected stratification in genome-wide association studies. Elife 8, e39702 (2019)CrossRef
27.
Zurück zum Zitat Yu, G., Wang, L.G., Han, Y., He, Q.Y.: clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS J. Integr. Biol. 16(5), 284–287 (2012) Yu, G., Wang, L.G., Han, Y., He, Q.Y.: clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS J. Integr. Biol. 16(5), 284–287 (2012)
28.
Zurück zum Zitat Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. B 67(2), 301–320 (2005)MathSciNetCrossRef Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. B 67(2), 301–320 (2005)MathSciNetCrossRef
Metadaten
Titel
A Fast, Provably Accurate Approximation Algorithm for Sparse Principal Component Analysis Reveals Human Genetic Variation Across the World
verfasst von
Agniva Chowdhury
Aritra Bose
Samson Zhou
David P. Woodruff
Petros Drineas
Copyright-Jahr
2022
DOI
https://doi.org/10.1007/978-3-031-04749-7_6