Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices

Authors:
Yasuo Tabei

Japan Science and Technology Agency, Saitama, Japan

Japan Science and Technology Agency, Saitama, Japan
View Profile

,
Hiroto Saigo

Kyushu University, Fukuoka , Japan

Kyushu University, Fukuoka , Japan
View Profile

,
Yoshihiro Yamanishi

Kyushu University, Fukuoka, Japan

Kyushu University, Fukuoka, Japan
View Profile

,
Simon J. Puglisi

University of Helsinki, Helsinki, Finland

University of Helsinki, Helsinki, Finland
View Profile

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2016Pages 1875–1884https://doi.org/10.1145/2939672.2939864

Published:13 August 2016Publication History

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 1875–1884

ABSTRACT

With massive high-dimensional data now commonplace in research and industry, there is a strong and growing demand for more scalable computational techniques for data analysis and knowledge discovery. Key to turning these data into knowledge is the ability to learn statistical models with high interpretability. Current methods for learning statistical models either produce models that are not interpretable or have prohibitive computational costs when applied to massive data. In this paper we address this need by presenting a scalable algorithm for partial least squares regression (PLS), which we call compression-based PLS (cPLS), to learn predictive linear models with a high interpretability from massive high-dimensional data. We propose a novel grammar-compressed representation of data matrices that supports fast row and column access while the data matrix is in a compressed form. The original data matrix is grammar-compressed and then the linear model in PLS is learned on the compressed data matrix, which results in a significant reduction in working space, greatly improving scalability. We experimentally test cPLS on its ability to learn linear models for classification, regression and feature extraction with various massive high-dimensional data, and show that cPLS performs superiorly in terms of prediction accuracy, computational efficiency, and interpretability.

References

P. Bille, P. H. Cording, and I. L. Gortz. Compact q-gram profiling of compressed strings. In CPM, pages 62--73, 2013.Google ScholarCross Ref
M Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51:2554--2576, 2005. Google ScholarDigital Library
B. Chen, D. Wild, and R. Guha. PubChem as a source of polypharmacology. JCIM, 49:2044--2055, 2009.Google Scholar
The Uniprot Consortium. The universal protein resource (uniprot) in 2010. NAR, 38:D142--D148, 2010.Google ScholarCross Ref
G. Cormode and Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55:58--75, 2005. Google ScholarDigital Library
D. Demaine, A. López-Ortiz, and I. Munro. Frequency estimation of internet packet streams with limited space. In ESA, pages 348--360, 2002. Google ScholarDigital Library
C.M. Dobson. Chemical space and biology. Nature, 432(7019):824--828, 2004.Google ScholarCross Ref
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121--2159, 2011. Google ScholarDigital Library
T. Elgamal, M. Yabandeh, A. Aboulnaga, W. Mustafa, and M. Hefeeda. sPCA: Scalable principal component analysis for big data on distributed platforms. In SIGMOD, 2015. Google ScholarDigital Library
R. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin. LIBLINEAR: A library for large linear classification. JMLR, pages 1871--1874, 2008. Google ScholarDigital Library
D. A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall Professional Technical Reference, 2002. Google ScholarDigital Library
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. Google ScholarDigital Library
D. Hermelin, D. H. Landau, and O. Weimann. A unified algorithm for accelerating edit-distance computation via text-compression. In STACS, pages 529--540, 2009.Google Scholar
I. T. Jolliffe. Principal Component Analysis. Springer, 1986.Google ScholarCross Ref
R. Karp, S. Shenker, and C. Papadimitriou. A simple algorithm for finding frequent elements in sets and bags. TODS, 28:51--55, 2003. Google ScholarDigital Library
J. C. Kieffer, E. Yang, G. J. Nelson, and P. C. Cosman. Universal lossless compression via multilevel pattern matching. IEEE Transactions on Information Theory, 46(4):1227--1245, 2000. Google ScholarDigital Library
M. Kuhn, D. Szklarczyk, A. Franceschini, M. Campillos, C. von Mering, L.J. Jensen, A. Beyer, and P. Bork. STITCH 2: An interaction network database for small molecules and proteins. NAR, 38(suppl 1):D552--D556, 2010.Google ScholarCross Ref
C. Lanczos. An iteration method for the solution of the eigen value problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards, 45:255--282, 1950.Google ScholarCross Ref
J. Larsson and A. Moffat. Offline dictionary-based compression. In DCC, pages 296--305, 1999. Google ScholarDigital Library
P. Li and A. C. König. b-bit minwise hashing. In WWW, pages 671--680, 2010. Google ScholarDigital Library
P. Li, A. Shrivastava, J. L. Moore, and A. C. König. Hashing algorithms for large-scale learning. In NIPS, pages 2672--2680, 2011.Google ScholarDigital Library
G. Manku and R. Motwani. Approximate frequency counts over data stream. In VLDB, volume 5, pages 346--357, 2002. Google ScholarDigital Library
C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999. Google ScholarDigital Library
J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In RecSys, 2013. Google ScholarDigital Library
Y. Mu, G. Hua, W. Fan, and S. Chang. Hash-SVM: Scalable kernel machines for large-scale visual classification. In CVPR, pages 979--986, 2014. Google ScholarDigital Library
R. Rosipal and N. Kr\"amer. Overview and recent advances in partial least squares. LNCS, pages 34--51. Springer, 2006. Google ScholarDigital Library
W. Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. TCS, 302(1--3):211--222, 2003. Google ScholarDigital Library
B. Schölkopf, A. Smola, and K. R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299--1319, 1998. Google ScholarDigital Library
B. Stockwell. Chemical genetics: Ligand-based discovery of gene function. Nature Reviews Genetics, 1:116--125, 2000.Google ScholarCross Ref
Y. Tabei and Y. Yamanishi. Scalable prediction of compound-protein interactions using minwise-hashing. BMC Systems Biology, 7:S3, 2013.Google ScholarCross Ref
M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analysers. Neural Computation, 11:443--482, 1999. Google ScholarDigital Library
R. Todeschini and V. Consonni. Handbook of Molecular Descriptors. Wiley-VCH, 2002.Google Scholar
Y. Tsuruoka, J. Tsujii, and S. Ananiadou. Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In ACL and AFNLP, pages 477--485, 2009. Google ScholarDigital Library
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In ICML, pages 1113--1120, 2009. Google ScholarDigital Library
H. Wold. Path models with latent variables: The NIPALS approach. In Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building, pages 307--357. Academic Press, 1975.Google ScholarCross Ref
S. Wold, M. Sjöström, and L. Eriksson. PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58:109--130, 2001.Google ScholarCross Ref
T. Yamamoto, H. Bannai, S. Inenaga, and M. Takeda. Faster subsequence and don't-care pattern maching on compressed texts. In CPM, pages 309--322, 2011. Google ScholarDigital Library
X. Yan and J. Han. gSpan: graph-based substructure pattern mining. In ICDM, pages 721--724, 2002. Google ScholarDigital Library
H.F. Yu, C.J. Hsieh, K.W. Chang, and C.J. Lin. Large linear classification when data cannot fit in memory. In KDD, pages 833--842, 2010. Google ScholarDigital Library

Index Terms

Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices

Recommendations

Comparison of Principal Component Regression and Partial Least Squares Regression by R
ICEICE '12: Proceedings of the 2012 Second International Conference on Electric Information and Control Engineering - Volume 04

Principle component regression (PCR) and partial least squares regression (PLSR) are two methodologies commonly used to solve dimension reduction and the multi-co linearity problem. The goal of this paper is to analyze and predict dependent variable ...
Read More
Multivariate Modeling Analysis Based on Partial Least Squares Regression and Principal Component Regression
BIC 2022: 2022 2nd International Conference on Bioinformatics and Intelligent Computing

In view of the high dimensionality of data in many fields and the serious multiple correlation between variables, this paper proposes an interpretable partial least square regression (PLSR) modeling method. Compared with principal component regression (...
Read More
Human gait recognition using localized Grassmann mean representatives with partial least squares regression

Gait recognition has become popular due to the rising demand for nonintrusive biometrics. At its nascent stage of development, gait recognition faces a number of challenges. The performance of a gait recognition system is sensitive towards factors like ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
General Chairs:
Balaji Krishnapuram
IBM
,
Mohak Shah
Bosch
,
Program Chairs:
Alex Smola
Amazon
,
Charu Aggarwal
IBM
,
Dou Shen
Baidu
,
Rajeev Rastogi
Amazon
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 August 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
big data
grammar compression
large-scale machine learning
partial least squares regression
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 662
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Comparison of Principal Component Regression and Partial Least Squares Regression by R

Multivariate Modeling Analysis Based on Partial Least Squares Regression and Principal Component Regression

Human gait recognition using localized Grassmann mean representatives with partial least squares regression