skip to main content
10.1145/2939672.2939864acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open Access

Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices

Published:13 August 2016Publication History

ABSTRACT

With massive high-dimensional data now commonplace in research and industry, there is a strong and growing demand for more scalable computational techniques for data analysis and knowledge discovery. Key to turning these data into knowledge is the ability to learn statistical models with high interpretability. Current methods for learning statistical models either produce models that are not interpretable or have prohibitive computational costs when applied to massive data. In this paper we address this need by presenting a scalable algorithm for partial least squares regression (PLS), which we call compression-based PLS (cPLS), to learn predictive linear models with a high interpretability from massive high-dimensional data. We propose a novel grammar-compressed representation of data matrices that supports fast row and column access while the data matrix is in a compressed form. The original data matrix is grammar-compressed and then the linear model in PLS is learned on the compressed data matrix, which results in a significant reduction in working space, greatly improving scalability. We experimentally test cPLS on its ability to learn linear models for classification, regression and feature extraction with various massive high-dimensional data, and show that cPLS performs superiorly in terms of prediction accuracy, computational efficiency, and interpretability.

References

  1. P. Bille, P. H. Cording, and I. L. Gortz. Compact q-gram profiling of compressed strings. In CPM, pages 62--73, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  2. M Charikar, E. Lehman, D. Liu, R. Panigrahy, M. Prabhakaran, A. Sahai, and A. Shelat. The smallest grammar problem. IEEE Transactions on Information Theory, 51:2554--2576, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Chen, D. Wild, and R. Guha. PubChem as a source of polypharmacology. JCIM, 49:2044--2055, 2009.Google ScholarGoogle Scholar
  4. The Uniprot Consortium. The universal protein resource (uniprot) in 2010. NAR, 38:D142--D148, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  5. G. Cormode and Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55:58--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Demaine, A. López-Ortiz, and I. Munro. Frequency estimation of internet packet streams with limited space. In ESA, pages 348--360, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C.M. Dobson. Chemical space and biology. Nature, 432(7019):824--828, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  8. J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121--2159, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Elgamal, M. Yabandeh, A. Aboulnaga, W. Mustafa, and M. Hefeeda. sPCA: Scalable principal component analysis for big data on distributed platforms. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin. LIBLINEAR: A library for large linear classification. JMLR, pages 1871--1874, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. A. Forsyth and J. Ponce. Computer Vision: A Modern Approach. Prentice Hall Professional Technical Reference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Hermelin, D. H. Landau, and O. Weimann. A unified algorithm for accelerating edit-distance computation via text-compression. In STACS, pages 529--540, 2009.Google ScholarGoogle Scholar
  14. I. T. Jolliffe. Principal Component Analysis. Springer, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  15. R. Karp, S. Shenker, and C. Papadimitriou. A simple algorithm for finding frequent elements in sets and bags. TODS, 28:51--55, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. C. Kieffer, E. Yang, G. J. Nelson, and P. C. Cosman. Universal lossless compression via multilevel pattern matching. IEEE Transactions on Information Theory, 46(4):1227--1245, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Kuhn, D. Szklarczyk, A. Franceschini, M. Campillos, C. von Mering, L.J. Jensen, A. Beyer, and P. Bork. STITCH 2: An interaction network database for small molecules and proteins. NAR, 38(suppl 1):D552--D556, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  18. C. Lanczos. An iteration method for the solution of the eigen value problem of linear differential and integral operators. Journal of Research of the National Bureau of Standards, 45:255--282, 1950.Google ScholarGoogle ScholarCross RefCross Ref
  19. J. Larsson and A. Moffat. Offline dictionary-based compression. In DCC, pages 296--305, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Li and A. C. König. b-bit minwise hashing. In WWW, pages 671--680, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Li, A. Shrivastava, J. L. Moore, and A. C. König. Hashing algorithms for large-scale learning. In NIPS, pages 2672--2680, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. G. Manku and R. Motwani. Approximate frequency counts over data stream. In VLDB, volume 5, pages 346--357, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. D. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In RecSys, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Mu, G. Hua, W. Fan, and S. Chang. Hash-SVM: Scalable kernel machines for large-scale visual classification. In CVPR, pages 979--986, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Rosipal and N. Kr\"amer. Overview and recent advances in partial least squares. LNCS, pages 34--51. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Rytter. Application of Lempel-Ziv factorization to the approximation of grammar-based compression. TCS, 302(1--3):211--222, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. Schölkopf, A. Smola, and K. R. Müller. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation, 10:1299--1319, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Stockwell. Chemical genetics: Ligand-based discovery of gene function. Nature Reviews Genetics, 1:116--125, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  30. Y. Tabei and Y. Yamanishi. Scalable prediction of compound-protein interactions using minwise-hashing. BMC Systems Biology, 7:S3, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  31. M. E. Tipping and C. M. Bishop. Mixtures of probabilistic principal component analysers. Neural Computation, 11:443--482, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Todeschini and V. Consonni. Handbook of Molecular Descriptors. Wiley-VCH, 2002.Google ScholarGoogle Scholar
  33. Y. Tsuruoka, J. Tsujii, and S. Ananiadou. Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty. In ACL and AFNLP, pages 477--485, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for large scale multitask learning. In ICML, pages 1113--1120, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. H. Wold. Path models with latent variables: The NIPALS approach. In Quantitative Sociology: International Perspectives on Mathematical and Statistical Model Building, pages 307--357. Academic Press, 1975.Google ScholarGoogle ScholarCross RefCross Ref
  36. S. Wold, M. Sjöström, and L. Eriksson. PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58:109--130, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  37. T. Yamamoto, H. Bannai, S. Inenaga, and M. Takeda. Faster subsequence and don't-care pattern maching on compressed texts. In CPM, pages 309--322, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. X. Yan and J. Han. gSpan: graph-based substructure pattern mining. In ICDM, pages 721--724, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. H.F. Yu, C.J. Hsieh, K.W. Chang, and C.J. Lin. Large linear classification when data cannot fit in memory. In KDD, pages 833--842, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scalable Partial Least Squares Regression on Grammar-Compressed Data Matrices

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
          August 2016
          2176 pages
          ISBN:9781450342322
          DOI:10.1145/2939672

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 August 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader