research-article

Efficient computation of PCA with SVD in SQL

Authors:
Mario Navas

University of Houston, Houston, TX

University of Houston, Houston, TX
View Profile

,
Carlos Ordonez

University of Houston, Houston, TX

University of Houston, Houston, TX
View Profile

DMMT '09: Proceedings of the 2nd Workshop on Data Mining using Matrices and TensorsJune 2009Article No.: 5Pages 1–10https://doi.org/10.1145/1581114.1581119

Published:28 June 2009Publication History

DMMT '09: Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors

Pages 1–10

ABSTRACT

PCA is one of the most common dimensionality reduction techniques with broad applications in data mining, statistics and signal processing. In this work we study how to leverage a DBMS computing capabilities to solve PCA. We propose a solution that combines a summarization of the data set with the correlation or covariance matrix and then solve PCA with Singular Value Decomposition (SVD). Deriving the summary matrices allow analyzing large data sets since they can be computed in a single pass. Solving SVD without external libraries proves to be a challenge to compute in SQL. We introduce two solutions: one based in SQL queries and a second one based on User-Defined Functions. Experimental evaluation shows our method can solve larger problems in less time than external statistical packages.

References

C. Boutsidis, W. M. Mahoney, and P. Drineas. Unsupervised feature selection for principal components analysis. In KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 61--69, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Database Syst., 27(2):188--228, 2002. Google ScholarDigital Library
S. Chitroub, A. Houacine, and B. Sansal. A new pca-based method for data compression and enhancement of multi-frequency polarimetric sar imagery. Intell. Data Anal., 6(2):187--207, 2002. Google ScholarDigital Library
A. d'Aspremont, F. Bach, and L. Ghaoui. Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res., 9:1269--1294, 2008. Google ScholarDigital Library
C. Ding and X. He. K-means clustering via principal component analysis. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 29, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
J. J. Gerbrands. On the relationships between svd, klt and pca. Pattern Recognition, 14(1--6):375--381, 1981.Google Scholar
J. Han and M. Kamber. Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, September 2000. Google ScholarDigital Library
T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, New York, 1st edition, 2001.Google Scholar
M. Hubert and S. Engelen. Robust pca and classification in biosciences. Bioinformatics, 20(11):1728--1736, 2004. Google ScholarDigital Library
N. Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent variable models. J. Mach. Learn. Res., 6:1783--1816, 2005. Google ScholarDigital Library
S. Mosci, L. Rosasco, and A. Verri. Dimensionality reduction and generalization. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 657--664, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
V. Nadimpally and M. J. Zaki. A novel approach to determine normal variation in gene expression data. SIGKDD Explor. Newsl., 5(2):6--15, 2003. Google ScholarDigital Library
C. Ordonez. Horizontal aggregations for building tabular data sets. In DMKD '04: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pages 35--42, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
C. Ordonez. Vertical and horizontal percentage aggregations. In SIGMOD Conference, pages 866--871, 2004. Google ScholarDigital Library
C. Ordonez. Optimizing recursive queries in sql. In SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 834--839, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
C. Ordonez. Building statistical models and scoring with udfs. In SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1005--1016, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
C. Ordonez and J. Garc&#237;a-Garc&#237;a. Vector and matrix operations programmed with udfs in a relational dbms. In CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 503--512, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
H. Polat and W. Du. Svd-based collaborative filtering with privacy. In SAC '05: Proceedings of the 2005 ACM symposium on Applied computing, pages 791--795, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, NY, USA, 2007. Google ScholarDigital Library
W. Rinsurongkawong and C. Ordonez. Microarray data analysis with pca in a dbms. In DTMBIO '08: Proceeding of the 2nd international workshop on Data and text mining in bioinformatics, pages 13--20, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
S. Salleh, A. Y Zomaya, and A. B. Sakhinah. Computing for numerical methods using visual C++. Wiley, Hoboken, NJ, 2008. Google ScholarDigital Library
Polyxeni Zacharouli, Michalis Titsias, and Michalis Vazirgiannis. Web page rank prediction with pca and em clustering. In WAW '09: Proceedings of the 6th International Workshop on Algorithms and Models for the Web-Graph, pages 104--115, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
X. S. Zhuang and D. Q. Dai. Improved discriminate analysis for high-dimensional data and its application to face recognition. Pattern Recogn., 40(5):1570--1578, 2007. Google ScholarDigital Library

Index Terms

Efficient computation of PCA with SVD in SQL

Recommendations

SVD based initialization: A head start for nonnegative matrix factorization

We describe Nonnegative Double Singular Value Decomposition (NNDSVD), a new method designed to enhance the initialization stage of nonnegative matrix factorization (NMF). NNDSVD can readily be combined with existing NMF algorithms. The basic algorithm ...
Read More
Microarray data analysis with PCA in a DBMS
DTMBIO '08: Proceedings of the 2nd international workshop on Data and text mining in bioinformatics

Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside ...
Read More
PCA for large data sets with parallel data summarization

Parallel processing is essential for large-scale analytics. Principal Component Analysis (PCA) is a well known model for dimensionality reduction in statistical analysis, which requires a demanding number of I/O and CPU operations. In this paper, we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DMMT '09: Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
June 2009
52 pages
ISBN:9781605586731
DOI:10.1145/1581114
Conference Chairs:
Chris Ding
University of Texas at Arlington
,
Tao Li
Florida International University
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 June 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
DBMS
PCA
SQL
SVD
Qualifiers
- research-article
Conference
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 603
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient computation of PCA with SVD in SQL

DMMT '09: Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors

ABSTRACT

References

Cited By

Index Terms

Recommendations

SVD based initialization: A head start for nonnegative matrix factorization

Microarray data analysis with PCA in a DBMS

PCA for large data sets with parallel data summarization