skip to main content
10.1145/1581114.1581119acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Efficient computation of PCA with SVD in SQL

Published:28 June 2009Publication History

ABSTRACT

PCA is one of the most common dimensionality reduction techniques with broad applications in data mining, statistics and signal processing. In this work we study how to leverage a DBMS computing capabilities to solve PCA. We propose a solution that combines a summarization of the data set with the correlation or covariance matrix and then solve PCA with Singular Value Decomposition (SVD). Deriving the summary matrices allow analyzing large data sets since they can be computed in a single pass. Solving SVD without external libraries proves to be a challenge to compute in SQL. We introduce two solutions: one based in SQL queries and a second one based on User-Defined Functions. Experimental evaluation shows our method can solve larger problems in less time than external statistical packages.

References

  1. C. Boutsidis, W. M. Mahoney, and P. Drineas. Unsupervised feature selection for principal components analysis. In KDD '08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 61--69, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. K. Chakrabarti, E. Keogh, S. Mehrotra, and M. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. ACM Trans. Database Syst., 27(2):188--228, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chitroub, A. Houacine, and B. Sansal. A new pca-based method for data compression and enhancement of multi-frequency polarimetric sar imagery. Intell. Data Anal., 6(2):187--207, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. d'Aspremont, F. Bach, and L. Ghaoui. Optimal solutions for sparse principal component analysis. J. Mach. Learn. Res., 9:1269--1294, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Ding and X. He. K-means clustering via principal component analysis. In ICML '04: Proceedings of the twenty-first international conference on Machine learning, page 29, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. J. Gerbrands. On the relationships between svd, klt and pca. Pattern Recognition, 14(1--6):375--381, 1981.Google ScholarGoogle Scholar
  7. J. Han and M. Kamber. Data Mining: Concepts and Techniques (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann, September 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, New York, 1st edition, 2001.Google ScholarGoogle Scholar
  9. M. Hubert and S. Engelen. Robust pca and classification in biosciences. Bioinformatics, 20(11):1728--1736, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. N. Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent variable models. J. Mach. Learn. Res., 6:1783--1816, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Mosci, L. Rosasco, and A. Verri. Dimensionality reduction and generalization. In ICML '07: Proceedings of the 24th international conference on Machine learning, pages 657--664, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. Nadimpally and M. J. Zaki. A novel approach to determine normal variation in gene expression data. SIGKDD Explor. Newsl., 5(2):6--15, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Ordonez. Horizontal aggregations for building tabular data sets. In DMKD '04: Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pages 35--42, New York, NY, USA, 2004. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Ordonez. Vertical and horizontal percentage aggregations. In SIGMOD Conference, pages 866--871, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Ordonez. Optimizing recursive queries in sql. In SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, pages 834--839, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Ordonez. Building statistical models and scoring with udfs. In SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1005--1016, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Ordonez and J. García-García. Vector and matrix operations programmed with udfs in a relational dbms. In CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management, pages 503--512, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Polat and W. Du. Svd-based collaborative filtering with privacy. In SAC '05: Proceedings of the 2005 ACM symposium on Applied computing, pages 791--795, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York, NY, USA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Rinsurongkawong and C. Ordonez. Microarray data analysis with pca in a dbms. In DTMBIO '08: Proceeding of the 2nd international workshop on Data and text mining in bioinformatics, pages 13--20, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Salleh, A. Y Zomaya, and A. B. Sakhinah. Computing for numerical methods using visual C++. Wiley, Hoboken, NJ, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Polyxeni Zacharouli, Michalis Titsias, and Michalis Vazirgiannis. Web page rank prediction with pca and em clustering. In WAW '09: Proceedings of the 6th International Workshop on Algorithms and Models for the Web-Graph, pages 104--115, Berlin, Heidelberg, 2009. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. X. S. Zhuang and D. Q. Dai. Improved discriminate analysis for high-dimensional data and its application to face recognition. Pattern Recogn., 40(5):1570--1578, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient computation of PCA with SVD in SQL

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                DMMT '09: Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors
                June 2009
                52 pages
                ISBN:9781605586731
                DOI:10.1145/1581114
                • Conference Chairs:
                • Chris Ding,
                • Tao Li

                Copyright © 2009 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 28 June 2009

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Author Tags

                Qualifiers

                • research-article

                Upcoming Conference

                KDD '24

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader