ABSTRACT
Principal Components Analysis (PCA) is the predominant linear dimensionality reduction technique, and has been widely applied on datasets in all scientific domains. We consider, both theoretically and empirically, the topic of unsupervised feature selection for PCA, by leveraging algorithms for the so-called Column Subset Selection Problem (CSSP). In words, the CSSP seeks the "best" subset of exactly k columns from an m x n data matrix A, and has been extensively studied in the Numerical Linear Algebra community. We present a novel two-stage algorithm for the CSSP. From a theoretical perspective, for small to moderate values of k, this algorithm significantly improves upon the best previously-existing results [24, 12] for the CSSP. From an empirical perspective, we evaluate this algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis: finance, document-term data, and genetics. We pay particular attention to how this algorithm may be used to select representative or landmark features from an object-feature matrix in an unsupervised manner. In all three application domains, we are able to identify k landmark features, i.e., columns of the data matrix, that capture nearly the same amount of information as does the subspace that is spanned by the top k "eigenfeatures."
- A. d'Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G.Lanckriet. A Direct Formulation for Sparse PCA Using Semidefinite Programming. SIAM Review, 49(3), July 2007 Google ScholarDigital Library
- A. Ben-Hur and I. Guyon. Detecting stable clusters using principal component analysis. Methods Mol Biol, 224:159--182, 2003.Google Scholar
- M.W. Berry, S.A. Pulatova, and G.W. Stewart. Computing sparse reduced-rank approximations to sparse matrices.ACM Trans on Math Soft, 31:252--269, 2005. Google ScholarDigital Library
- C. H. Bischof and G. Qintana-Ortí. Algorithm 782: codes for rank-revealing QR factorizations of dense matrices. ACM Trans on Math Soft, 24:254--257, 1998. Google ScholarDigital Library
- C.H. Bischof and G. Quintana-Ortí. Computing rank-revealing QR factorizations of dense matrices. ACM Trans on Math Soft, 24(2):226--253, 1998. Google ScholarDigital Library
- C. Boutsidis, M.W. Mahoney, and P. Drineas. Manuscript in preparation, 2008Google Scholar
- J. Cadima, J. O. Cerdeira, and M. Minhoto. Computational aspects of algorithms for variable selection in thecontext of principal components. Computational Statistics & Data Analysis, 47(2):225--236, 2004.Google ScholarCross Ref
- T. F. Chan. Rank revealing QR factorizations. Linear Algebra Appl, 88/89:67--82, 1987.Google ScholarCross Ref
- T.F. Chan and P.C. Hansen. Some applications of the rank revealing QR factorization. SIAM J Sci and Stat Comp, 13:727--741, 1992. Google ScholarDigital Library
- T. F. Chan and P.C. Hansen. Low-rank revealing QR factorizations. Linear Algebra Appl, 1:33--44, 1994.Google ScholarCross Ref
- S. Chandrasekaran and I. C. F. Ipsen. On rank-revealing factorizations. SIAM J Matrix Anal Appl, 15:592--622, 1994. Google ScholarDigital Library
- A. Deshpande, L. Rademacher, S. Vempala, and G. Wang. Matrix approximation and projective clustering via volume sampling. SODA, 2006 Google ScholarDigital Library
- M. Devaney and A. Ram. Efficient feature selection in conceptual clustering. In ICML, 1997. Google ScholarDigital Library
- P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Subspace sampling and relative-error matrix approximation: Column-based methods, APPROX-RANDOM, 2006. Google ScholarDigital Library
- P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. http://arxiv.org/abs/0708.3696, 2007.Google Scholar
- J G. Dy and C E. Brodley. Feature selection for unsupervised learning. J. Mach. Learn. Res., 5:845--889, 2004. Google ScholarCross Ref
- R.D. Fierro, P.C. Hansen, and P. Hansen. UTV tools: Matlab templates for rank-revealing UTV decompositions. Numerical Algorithms, 20(2-3):165--194, 1999.Google ScholarCross Ref
- L. V. Foster. Rank and null space calculations using matrix decomposition without column interchanges. Linear Algebra Appl, 74:47--71, 1986.Google ScholarCross Ref
- L.V. Foster and Xinrong Liu. Comparison of rank revealing algorithms applied to matrices with welldefined numerical ranks. manuscript, 2006.Google Scholar
- A. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for finding low-rank approximations, FOCS, 1998 Google ScholarDigital Library
- E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. ICML, 2004 Google ScholarDigital Library
- G. H. Golub. Numerical methods for solving linear least squares problems. Numer Math, 7:206--216, 1965.Google ScholarDigital Library
- G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 1989.Google Scholar
- M. Gu and S.C. Eisenstat. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J Sci Comp, 17:848--869, 1996. Google ScholarDigital Library
- Y. P. Hong and C. T. Pan. Rank-revealing QR factorizations and the singular value decomposition. Math Comp, 58:213--232, 1992.Google Scholar
- W. J. Krzanowski. Selection of variables to preserve multivariate data structure,using principal components. Applied Statistics,36(1):22--33, 1987.Google ScholarCross Ref
- F.G. Kuruvilla and P.J. Park and S.L. Schreiber. Vector algebra in the analysis of genome-wide expression data. Genome Biology, 3, 2002.Google Scholar
- K. Z. Mao. Identifying critical variables of principal components forunsupervised feature selection. IEEE Transactions on Systems,Man, and Cybernetics, Part B, 35(2):339--344, 2005. Google ScholarDigital Library
- M.W. Mahoney, M. Maggioni, and P. Drineas. Tensor-CUR decompositions for tensor-based data, KDD, 2006. Google ScholarDigital Library
- P. Menozzi, A. Piazza, and L. Cavalli-Sforza. Synthetic maps of human gene frequencies in Europeans . Science, 201(4358):786--792, 1978.Google ScholarCross Ref
- P. Mitra, C. A. Murthy, and S. K. Pal. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell., 24(3):301--312,2002. Google ScholarDigital Library
- Open Directory Project. http://www.dmoz.org/Google Scholar
- C. T. Pan. On the existence and computation of rank-revealing LUfactorizations. Linear Algebra Appl, 316:199--222, 2000.Google ScholarCross Ref
- C. T. Pan and P. T. P. Tang. Bounds on singular values revealed by QR factorizations. BIT Numerical Mathematics, 39:740--756, 1999.Google ScholarCross Ref
- P. Paschou, E. Ziv, E.G. Burchard, S. Choudhry, W.R. Cintron, M.WMahoney, and P. Drineas. PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations, PLoS Genetics,9(3), 2007.Google Scholar
- P. Paschou, M.W. Mahoney, A. Javed, J.R. Kidd, A.J. Pakstis, S.Gu, K.K. Kidd, and P. Drineas. Intra- and Inter-population genotype reconstruction from tagging SNPs, Genome Research, 17: 96--107, 2007.Google ScholarCross Ref
- N.A. Rosenberg, L.M. Li, R. Ward, and J.K. Pritchard. Informativeness of genetic markers for inference of ancestry. Am J Hum Genet, 73(6):1402--1422, 2003.Google ScholarCross Ref
- G.W. Stewart. Four algorithms for the efficientcomputation of truncated QR approximations to a sparse matrix. Num Math, 83:313--323, 1999.Google ScholarCross Ref
- H. Stoppiglia, G. Dreyfus, R. Dubois, and Y. Oussar. Ranking a random feature for variable and feature selection. J. Mach. Learn. Res., 3:1399--1414, 2003. Google ScholarDigital Library
- J. Sun, Y. Xie, H. Zhang, and C. Faloutsos. Less is more: Compact matrix decomposition for large sparse graphs, SDM, 2007.Google ScholarCross Ref
- The International HapMap Consortium. A haplotype map of the human genome. Nature, 437:1299--1320, 2005.Google ScholarCross Ref
- L. Wolf and A. Shashua. Feature selection for unsupervised and supervised inference: Theemergence of sparsity in a weight-based approach. J. Mach.Learn. Res., 6:1855--1887, 2005. Google ScholarDigital Library
- Z. Zhao and H. Liu. Spectral feature selection for supervised and unsupervisedlearning. In ICML, 2007. Google ScholarDigital Library
- http://finance.yahoo.com/Google Scholar
Index Terms
- Unsupervised feature selection for principal components analysis
Recommendations
Feature selection based on improved principal component analysis
CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine LearningAbstract: The filtered feature selection method has low computational complexity and less time, and is widely used in feature selection, but the filtered method only considers the importance of features for label classification and ignores the ...
Unsupervised feature selection using weighted principal components
Research highlights Unsupervised features selection method is proposed. The proposed method can successfully detect true significant features. Integration of PCA and control charts techniques. Feature selection has received considerable attention in ...
Feature Selection Using Principal Component Analysis
ICSEM '10: Proceedings of the 2010 International Conference on System Science, Engineering Design and Manufacturing Informatization - Volume 01Principal component analysis (PCA) has been widely applied in the area of computer science. It is well-known that PCA is a popular transform method and the transform result is not directly related to a sole feature component of the original sample. ...
Comments