ABSTRACT
With massive high-dimensional data now commonplace in research and industry, there is a strong and growing demand for more scalable computational techniques for data analysis and knowledge discovery. Key to turning these data into knowledge is the ability to learn statistical models with high interpretability. Current methods for learning statistical models either produce models that are not interpretable or have prohibitive computational costs when applied to massive data. In this paper we address this need by presenting a scalable algorithm for partial least squares regression (PLS), which we call compression-based PLS (cPLS), to learn predictive linear models with a high interpretability from massive high-dimensional data. We propose a novel grammar-compressed representation of data matrices that supports fast row and column access while the data matrix is in a compressed form. The original data matrix is grammar-compressed and then the linear model in PLS is learned on the compressed data matrix, which results in a significant reduction in working space, greatly improving scalability. We experimentally test cPLS on its ability to learn linear models for classification, regression and feature extraction with various massive high-dimensional data, and show that cPLS performs superiorly in terms of prediction accuracy, computational efficiency, and interpretability.
- P. Bille, P. H. Cording, and I. L. Gortz. Compact q-gram profiling of compressed strings. In CPM, pages 62--73, 2013.Google ScholarCross Ref
- M Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51:2554--2576, 2005. Google ScholarDigital Library
- B. Chen, D. Wild, and R. Guha. PubChem as a source of polypharmacology. JCIM, 49:2044--2055, 2009.Google Scholar
- The Uniprot Consortium. The universal protein resource (uniprot) in 2010. NAR, 38:D142--D148, 2010.Google ScholarCross Ref
- G. Cormode and Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55:58--75, 2005. Google ScholarDigital Library
- D. Demaine, A. López-Ortiz, and I. Munro. Frequency estimation of internet packet streams with limited space. In ESA, pages 348--360, 2002. Google ScholarDigital Library
- C.M. Dobson. Chemical space and biology. Nature, 432(7019):824--828, 2004.Google ScholarCross Ref
- J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121--2159, 2011. Google ScholarDigital Library
- T. Elgamal, M. Yabandeh, A. Aboulnaga, W. Mustafa, and M. Hefeeda. sPCA: Scalable principal component analysis for big data on distributed platforms. In SIGMOD, 2015. Google ScholarDigital Library
- R. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin. LIBLINEAR: A library for large linear classification. JMLR, pages 1871--1874, 2008. Google ScholarDigital Library
- D. A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall Professional Technical Reference, 2002. Google ScholarDigital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. Google ScholarDigital Library
- D. Hermelin, D. H. Landau, and O. Weimann. A unified algorithm for accelerating edit-distance computation via text-compression. In STACS, pages 529--540, 2009.Google Scholar
- I. T. Jolliffe. Principal Component Analysis. Springer, 1986.Google ScholarCross Ref
- R. Karp, S. Shenker, and C. Papadimitriou. A simple algorithm for finding frequent elements in sets and bags. TODS, 28:51--55, 2003. Google ScholarDigital Library
- J. C. Kieffer, E. Yang, G. J. Nelson, and P. C. Cosman. Universal lossless compression via multilevel pattern matching. IEEE Transactions on Information Theory, 46(4):1227--1245, 2000. Google ScholarDigital Library
- M. Kuhn, D. Szklarczyk, A. Franceschini, M. Campillos, C. von Mering, L.J. Jensen, A. Beyer, and P. Bork. STITCH 2: An interaction network database for small molecules and proteins. NAR, 38(suppl 1):D552--D556, 2010.Google ScholarCross Ref
- C. Lanczos. An iteration method for the solution of the eigen value problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards, 45:255--282, 1950.Google ScholarCross Ref
- J. Larsson and A. Moffat. Offline dictionary-based compression. In DCC, pages 296--305, 1999. Google ScholarDigital Library
- P. Li and A. C. König. b-bit minwise hashing. In WWW, pages 671--680, 2010. Google ScholarDigital Library
- P. Li, A. Shrivastava, J. L. Moore, and A. C. König. Hashing algorithms for large-scale learning. In NIPS, pages 2672--2680, 2011.Google ScholarDigital Library
- G. Manku and R. Motwani. Approximate frequency counts over data stream. In VLDB, volume 5, pages 346--357, 2002. Google ScholarDigital Library
- C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999. Google ScholarDigital Library
- J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In RecSys, 2013. Google ScholarDigital Library
- Y. Mu, G. Hua, W. Fan, and S. Chang. Hash-SVM: Scalable kernel machines for large-scale visual classification. In CVPR, pages 979--986, 2014. Google ScholarDigital Library
- R. Rosipal and N. Kr\"amer. Overview and recent advances in partial least squares. LNCS, pages 34--51. Springer, 2006. Google ScholarDigital Library
- W. Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. TCS, 302(1--3):211--222, 2003. Google ScholarDigital Library
- B. Schölkopf, A. Smola, and K. R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299--1319, 1998. Google ScholarDigital Library
- B. Stockwell. Chemical genetics: Ligand-based discovery of gene function. Nature Reviews Genetics, 1:116--125, 2000.Google ScholarCross Ref
- Y. Tabei and Y. Yamanishi. Scalable prediction of compound-protein interactions using minwise-hashing. BMC Systems Biology, 7:S3, 2013.Google ScholarCross Ref
- M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analysers. Neural Computation, 11:443--482, 1999. Google ScholarDigital Library
- R. Todeschini and V. Consonni. Handbook of Molecular Descriptors. Wiley-VCH, 2002.Google Scholar
- Y. Tsuruoka, J. Tsujii, and S. Ananiadou. Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In ACL and AFNLP, pages 477--485, 2009. Google ScholarDigital Library
- K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In ICML, pages 1113--1120, 2009. Google ScholarDigital Library
- H. Wold. Path models with latent variables: The NIPALS approach. In Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building, pages 307--357. Academic Press, 1975.Google ScholarCross Ref
- S. Wold, M. Sjöström, and L. Eriksson. PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58:109--130, 2001.Google ScholarCross Ref
- T. Yamamoto, H. Bannai, S. Inenaga, and M. Takeda. Faster subsequence and don't-care pattern maching on compressed texts. In CPM, pages 309--322, 2011. Google ScholarDigital Library
- X. Yan and J. Han. gSpan: graph-based substructure pattern mining. In ICDM, pages 721--724, 2002. Google ScholarDigital Library
- H.F. Yu, C.J. Hsieh, K.W. Chang, and C.J. Lin. Large linear classification when data cannot fit in memory. In KDD, pages 833--842, 2010. Google ScholarDigital Library
Index Terms
- Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices
Recommendations
Comparison of Principal Component Regression and Partial Least Squares Regression by R
ICEICE '12: Proceedings of the 2012 Second International Conference on Electric Information and Control Engineering - Volume 04Principle component regression (PCR) and partial least squares regression (PLSR) are two methodologies commonly used to solve dimension reduction and the multi-co linearity problem. The goal of this paper is to analyze and predict dependent variable ...
Multivariate Modeling Analysis Based on Partial Least Squares Regression and Principal Component Regression
BIC 2022: 2022 2nd International Conference on Bioinformatics and Intelligent ComputingIn view of the high dimensionality of data in many fields and the serious multiple correlation between variables, this paper proposes an interpretable partial least square regression (PLSR) modeling method. Compared with principal component regression (...
Human gait recognition using localized Grassmann mean representatives with partial least squares regression
Gait recognition has become popular due to the rising demand for nonintrusive biometrics. At its nascent stage of development, gait recognition faces a number of challenges. The performance of a gait recognition system is sensitive towards factors like ...
Comments