ABSTRACT
PCA is one of the most common dimensionality reduction techniques with broad applications in data mining, statistics and signal processing. In this work we study how to leverage a DBMS computing capabilities to solve PCA. We propose a solution that combines a summarization of the data set with the correlation or covariance matrix and then solve PCA with Singular Value Decomposition (SVD). Deriving the summary matrices allow analyzing large data sets since they can be computed in a single pass. Solving SVD without external libraries proves to be a challenge to compute in SQL. We introduce two solutions: one based in SQL queries and a second one based on User-Defined Functions. Experimental evaluation shows our method can solve larger problems in less time than external statistical packages.
- C. Boutsidis, W. M. Mahoney, and P. Drineas. Unsupervised feature selection for principal components analysis. In KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 61--69, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Database Syst., 27(2):188--228, 2002. Google ScholarDigital Library
- S. Chitroub, A. Houacine, and B. Sansal. A new pca-based method for data compression and enhancement of multi-frequency polarimetric sar imagery. Intell. Data Anal., 6(2):187--207, 2002. Google ScholarDigital Library
- A. d'Aspremont, F. Bach, and L. Ghaoui. Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res., 9:1269--1294, 2008. Google ScholarDigital Library
- C. Ding and X. He. K-means clustering via principal component analysis. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 29, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- J. J. Gerbrands. On the relationships between svd, klt and pca. Pattern Recognition, 14(1--6):375--381, 1981.Google Scholar
- J. Han and M. Kamber. Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, September 2000. Google ScholarDigital Library
- T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, New York, 1st edition, 2001.Google Scholar
- M. Hubert and S. Engelen. Robust pca and classification in biosciences. Bioinformatics, 20(11):1728--1736, 2004. Google ScholarDigital Library
- N. Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent variable models. J. Mach. Learn. Res., 6:1783--1816, 2005. Google ScholarDigital Library
- S. Mosci, L. Rosasco, and A. Verri. Dimensionality reduction and generalization. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 657--664, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- V. Nadimpally and M. J. Zaki. A novel approach to determine normal variation in gene expression data. SIGKDD Explor. Newsl., 5(2):6--15, 2003. Google ScholarDigital Library
- C. Ordonez. Horizontal aggregations for building tabular data sets. In DMKD '04: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pages 35--42, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- C. Ordonez. Vertical and horizontal percentage aggregations. In SIGMOD Conference, pages 866--871, 2004. Google ScholarDigital Library
- C. Ordonez. Optimizing recursive queries in sql. In SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 834--839, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- C. Ordonez. Building statistical models and scoring with udfs. In SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1005--1016, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- C. Ordonez and J. García-García. Vector and matrix operations programmed with udfs in a relational dbms. In CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 503--512, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- H. Polat and W. Du. Svd-based collaborative filtering with privacy. In SAC '05: Proceedings of the 2005 ACM symposium on Applied computing, pages 791--795, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, NY, USA, 2007. Google ScholarDigital Library
- W. Rinsurongkawong and C. Ordonez. Microarray data analysis with pca in a dbms. In DTMBIO '08: Proceeding of the 2nd international workshop on Data and text mining in bioinformatics, pages 13--20, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- S. Salleh, A. Y Zomaya, and A. B. Sakhinah. Computing for numerical methods using visual C++. Wiley, Hoboken, NJ, 2008. Google ScholarDigital Library
- Polyxeni Zacharouli, Michalis Titsias, and Michalis Vazirgiannis. Web page rank prediction with pca and em clustering. In WAW '09: Proceedings of the 6th International Workshop on Algorithms and Models for the Web-Graph, pages 104--115, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarDigital Library
- X. S. Zhuang and D. Q. Dai. Improved discriminate analysis for high-dimensional data and its application to face recognition. Pattern Recogn., 40(5):1570--1578, 2007. Google ScholarDigital Library
Index Terms
- Efficient computation of PCA with SVD in SQL
Recommendations
SVD based initialization: A head start for nonnegative matrix factorization
We describe Nonnegative Double Singular Value Decomposition (NNDSVD), a new method designed to enhance the initialization stage of nonnegative matrix factorization (NMF). NNDSVD can readily be combined with existing NMF algorithms. The basic algorithm ...
Microarray data analysis with PCA in a DBMS
DTMBIO '08: Proceedings of the 2nd international workshop on Data and text mining in bioinformaticsMicroarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside ...
PCA for large data sets with parallel data summarization
Parallel processing is essential for large-scale analytics. Principal Component Analysis (PCA) is a well known model for dimensionality reduction in statistical analysis, which requires a demanding number of I/O and CPU operations. In this paper, we ...
Comments