research-article

Unsupervised feature selection for principal components analysis

Authors:
Christos Boutsidis

Rensselaer Polytechnic Institute, Troy, NY, USA

Rensselaer Polytechnic Institute, Troy, NY, USA
View Profile

,
Michael W. Mahoney

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

,
Petros Drineas

Rensselaer Polytechnic Institute, Troy, NY, USA

Rensselaer Polytechnic Institute, Troy, NY, USA
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 61–69https://doi.org/10.1145/1401890.1401903

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 61–69

ABSTRACT

Principal Components Analysis (PCA) is the predominant linear dimensionality reduction technique, and has been widely applied on datasets in all scientific domains. We consider, both theoretically and empirically, the topic of unsupervised feature selection for PCA, by leveraging algorithms for the so-called Column Subset Selection Problem (CSSP). In words, the CSSP seeks the "best" subset of exactly k columns from an m x n data matrix A, and has been extensively studied in the Numerical Linear Algebra community. We present a novel two-stage algorithm for the CSSP. From a theoretical perspective, for small to moderate values of k, this algorithm significantly improves upon the best previously-existing results [24, 12] for the CSSP. From an empirical perspective, we evaluate this algorithm as an unsupervised feature selection strategy in three application domains of modern statistical data analysis: finance, document-term data, and genetics. We pay particular attention to how this algorithm may be used to select representative or landmark features from an object-feature matrix in an unsupervised manner. In all three application domains, we are able to identify k landmark features, i.e., columns of the data matrix, that capture nearly the same amount of information as does the subspace that is spanned by the top k "eigenfeatures."

References

A. d'Aspremont, L. El Ghaoui, M. I. Jordan, and G. R. G.Lanckriet. A Direct Formulation for Sparse PCA Using Semidefinite Programming. SIAM Review, 49(3), July 2007 Google ScholarDigital Library
A. Ben-Hur and I. Guyon. Detecting stable clusters using principal component analysis. Methods Mol Biol, 224:159--182, 2003.Google Scholar
M.W. Berry, S.A. Pulatova, and G.W. Stewart. Computing sparse reduced-rank approximations to sparse matrices.ACM Trans on Math Soft, 31:252--269, 2005. Google ScholarDigital Library
C. H. Bischof and G. Qintana-Ortí. Algorithm 782: codes for rank-revealing QR factorizations of dense matrices. ACM Trans on Math Soft, 24:254--257, 1998. Google ScholarDigital Library
C.H. Bischof and G. Quintana-Ortí. Computing rank-revealing QR factorizations of dense matrices. ACM Trans on Math Soft, 24(2):226--253, 1998. Google ScholarDigital Library
C. Boutsidis, M.W. Mahoney, and P. Drineas. Manuscript in preparation, 2008Google Scholar
J. Cadima, J. O. Cerdeira, and M. Minhoto. Computational aspects of algorithms for variable selection in thecontext of principal components. Computational Statistics & Data Analysis, 47(2):225--236, 2004.Google ScholarCross Ref
T. F. Chan. Rank revealing QR factorizations. Linear Algebra Appl, 88/89:67--82, 1987.Google ScholarCross Ref
T.F. Chan and P.C. Hansen. Some applications of the rank revealing QR factorization. SIAM J Sci and Stat Comp, 13:727--741, 1992. Google ScholarDigital Library
T. F. Chan and P.C. Hansen. Low-rank revealing QR factorizations. Linear Algebra Appl, 1:33--44, 1994.Google ScholarCross Ref
S. Chandrasekaran and I. C. F. Ipsen. On rank-revealing factorizations. SIAM J Matrix Anal Appl, 15:592--622, 1994. Google ScholarDigital Library
A. Deshpande, L. Rademacher, S. Vempala, and G. Wang. Matrix approximation and projective clustering via volume sampling. SODA, 2006 Google ScholarDigital Library
M. Devaney and A. Ram. Efficient feature selection in conceptual clustering. In ICML, 1997. Google ScholarDigital Library
P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Subspace sampling and relative-error matrix approximation: Column-based methods, APPROX-RANDOM, 2006. Google ScholarDigital Library
P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. http://arxiv.org/abs/0708.3696, 2007.Google Scholar
J G. Dy and C E. Brodley. Feature selection for unsupervised learning. J. Mach. Learn. Res., 5:845--889, 2004. Google ScholarCross Ref
R.D. Fierro, P.C. Hansen, and P. Hansen. UTV tools: Matlab templates for rank-revealing UTV decompositions. Numerical Algorithms, 20(2-3):165--194, 1999.Google ScholarCross Ref
L. V. Foster. Rank and null space calculations using matrix decomposition without column interchanges. Linear Algebra Appl, 74:47--71, 1986.Google ScholarCross Ref
L.V. Foster and Xinrong Liu. Comparison of rank revealing algorithms applied to matrices with welldefined numerical ranks. manuscript, 2006.Google Scholar
A. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for finding low-rank approximations, FOCS, 1998 Google ScholarDigital Library
E. Gabrilovich and S. Markovitch. Text categorization with many redundant features: using aggressive feature selection to make SVMs competitive with C4.5. ICML, 2004 Google ScholarDigital Library
G. H. Golub. Numerical methods for solving linear least squares problems. Numer Math, 7:206--216, 1965.Google ScholarDigital Library
G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 1989.Google Scholar
M. Gu and S.C. Eisenstat. Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J Sci Comp, 17:848--869, 1996. Google ScholarDigital Library
Y. P. Hong and C. T. Pan. Rank-revealing QR factorizations and the singular value decomposition. Math Comp, 58:213--232, 1992.Google Scholar
W. J. Krzanowski. Selection of variables to preserve multivariate data structure,using principal components. Applied Statistics,36(1):22--33, 1987.Google ScholarCross Ref
F.G. Kuruvilla and P.J. Park and S.L. Schreiber. Vector algebra in the analysis of genome-wide expression data. Genome Biology, 3, 2002.Google Scholar
K. Z. Mao. Identifying critical variables of principal components forunsupervised feature selection. IEEE Transactions on Systems,Man, and Cybernetics, Part B, 35(2):339--344, 2005. Google ScholarDigital Library
M.W. Mahoney, M. Maggioni, and P. Drineas. Tensor-CUR decompositions for tensor-based data, KDD, 2006. Google ScholarDigital Library
P. Menozzi, A. Piazza, and L. Cavalli-Sforza. Synthetic maps of human gene frequencies in Europeans . Science, 201(4358):786--792, 1978.Google ScholarCross Ref
P. Mitra, C. A. Murthy, and S. K. Pal. Unsupervised feature selection using feature similarity. IEEE Trans. Pattern Anal. Mach. Intell., 24(3):301--312,2002. Google ScholarDigital Library
Open Directory Project. http://www.dmoz.org/Google Scholar
C. T. Pan. On the existence and computation of rank-revealing LUfactorizations. Linear Algebra Appl, 316:199--222, 2000.Google ScholarCross Ref
C. T. Pan and P. T. P. Tang. Bounds on singular values revealed by QR factorizations. BIT Numerical Mathematics, 39:740--756, 1999.Google ScholarCross Ref
P. Paschou, E. Ziv, E.G. Burchard, S. Choudhry, W.R. Cintron, M.WMahoney, and P. Drineas. PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations, PLoS Genetics,9(3), 2007.Google Scholar
P. Paschou, M.W. Mahoney, A. Javed, J.R. Kidd, A.J. Pakstis, S.Gu, K.K. Kidd, and P. Drineas. Intra- and Inter-population genotype reconstruction from tagging SNPs, Genome Research, 17: 96--107, 2007.Google ScholarCross Ref
N.A. Rosenberg, L.M. Li, R. Ward, and J.K. Pritchard. Informativeness of genetic markers for inference of ancestry. Am J Hum Genet, 73(6):1402--1422, 2003.Google ScholarCross Ref
G.W. Stewart. Four algorithms for the efficientcomputation of truncated QR approximations to a sparse matrix. Num Math, 83:313--323, 1999.Google ScholarCross Ref
H. Stoppiglia, G. Dreyfus, R. Dubois, and Y. Oussar. Ranking a random feature for variable and feature selection. J. Mach. Learn. Res., 3:1399--1414, 2003. Google ScholarDigital Library
J. Sun, Y. Xie, H. Zhang, and C. Faloutsos. Less is more: Compact matrix decomposition for large sparse graphs, SDM, 2007.Google ScholarCross Ref
The International HapMap Consortium. A haplotype map of the human genome. Nature, 437:1299--1320, 2005.Google ScholarCross Ref
L. Wolf and A. Shashua. Feature selection for unsupervised and supervised inference: Theemergence of sparsity in a weight-based approach. J. Mach.Learn. Res., 6:1855--1887, 2005. Google ScholarDigital Library
Z. Zhao and H. Liu. Spectral feature selection for supervised and unsupervisedlearning. In ICML, 2007. Google ScholarDigital Library
http://finance.yahoo.com/Google Scholar

Index Terms

Unsupervised feature selection for principal components analysis

Recommendations

Feature selection based on improved principal component analysis
CACML '23: Proceedings of the 2023 2nd Asia Conference on Algorithms, Computing and Machine Learning

Abstract: The filtered feature selection method has low computational complexity and less time, and is widely used in feature selection, but the filtered method only considers the importance of features for label classification and ignores the ...
Read More
Unsupervised feature selection using weighted principal components

Research highlights Unsupervised features selection method is proposed. The proposed method can successfully detect true significant features. Integration of PCA and control charts techniques. Feature selection has received considerable attention in ...
Read More
Feature Selection Using Principal Component Analysis
ICSEM '10: Proceedings of the 2010 International Conference on System Science, Engineering Design and Manufacturing Informatization - Volume 01

Principal component analysis (PCA) has been widely applied in the area of computer science. It is well-known that PCA is a popular transform method and the transform result is not directly related to a sole feature component of the original sample. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
PCA
random sampling
subset selection
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 69
  Total Citations
  View Citations
- 1,613
  Total Downloads
- Downloads (Last 12 months)66
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Unsupervised feature selection for principal components analysis

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Feature selection based on improved principal component analysis

Unsupervised feature selection using weighted principal components

Feature Selection Using Principal Component Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Unsupervised feature selection for principal components analysis

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Feature selection based on improved principal component analysis

Unsupervised feature selection using weighted principal components

Feature Selection Using Principal Component Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media