skip to main content
10.1145/502512.502546acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Random projection in dimensionality reduction: applications to image and text data

Authors Info & Claims
Published:26 August 2001Publication History

ABSTRACT

Random projections have recently emerged as a powerful method for dimensionality reduction. Theoretical results indicate that the method preserves distances quite nicely; however, empirical results are sparse. We present experimental results on using random projection as a dimensionality reduction tool in a number of cases, where the high dimensionality of the data would otherwise lead to burden-some computations. Our application areas are the processing of both noisy and noiseless images, and information retrieval in text documents. We show that projecting the data onto a random lower-dimensional subspace yields results comparable to conventional dimensionality reduction methods such as principal component analysis: the similarity of data vectors is preserved well under random projection. However, using random projections is computationally significantly less expensive than using, e.g., principal component analysis. We also show experimentally that using a sparse random matrix gives additional computational savings in random projection.

References

  1. 1.D. Achlioptas. Database-friendly random projections. In Proc. ACM Syrup. on the Principles of Database Systems, pages 274-281, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.C. C. Aggarwal, J. L. Wolf, and P. S. Yu. A new method for similarity indexing of market basket data. In Proc. 1999 ACM SIGMOD Int. Conf. on Management of data, pages 407-418, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.R. Agrawal, C. Faloutsos, and A. Swami. Efficient similarity search in sequence databases. In Proc. 4th Int. Conf. of Data Organization and Algorithms, pages 69-84. Springer, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.R. I. Arriaga and S. Vempala. An algorithmic theory of learning: robust concepts and random projection. In Proc. 40th Annual Syrup. on Foundations of Computer Science, pages 616-623. IEEE Computer Society Press, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.M.-W. Berry. Large-scale sparse singular value computations. International Journal of Super-Computer Applications, 6(1):13-49, 1992.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.S. Dasgupta. Learning mixtures of Gaussians. In ,40th Annual IEEE Symp. on Foundations of Computer Science, pages 634-644, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.S. Dasgupta. Experiments with random projection. In Proc. Uncertainty in Artificial Intelligence, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.S. Dasgupta and A. Gupta. An elementary proof of the Johnson-Lindenstrauss lemma. Technical Report TR-99-006, International Computer Science Institute, Berkeley, California, USA, 1999.]]Google ScholarGoogle Scholar
  9. 9.S. Deerwester, S.T. Dumais, G.W. Furnas, and T.K. Landauer. Indexing by latent semantic analysis. Journal of the Am. Soc. for Information Science, 41(6):391-407, 1990.]]Google ScholarGoogle ScholarCross RefCross Ref
  10. 10.P. Frankl and H. Maehara. The Johnson-Lindenstrauss lemma and the sphericity of some graphs. Journal of Combinatorial Theory, Ser. B, 44:355-362, 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.G.H. Golub and C.F. van Loan. Matrix Computations. North Oxford Academic, Oxford, UK, 1983.]]Google ScholarGoogle Scholar
  12. 12.A. Craps. An introduction to wavelets. IEEE Computational Science and Engineering, 2(2):50-61, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.R. Hecht-Nielsen. Context vectors: general purpose approximate meaning representations self-organized from raw data. In J.M. Zurada, R.J. Marks II, and C.J. Robinson, editors, Computational Intelligence: Imitating Life, pages 43-56. IEEE Press, 1994.]]Google ScholarGoogle Scholar
  14. 14.P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. 30th Syrup. on Theory of Computing, pages 604-613. ACM, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.W.B. Johnson and J. Lindenstranss. Extensions of Lipshitz mapping into Hilbert space. In Conference in modern analysis and probability, volume 26 of Contemporary Mathematics, pages 189-206. Amer. Math. Soc., 1984.]]Google ScholarGoogle ScholarCross RefCross Ref
  16. 16.S. Kaski. Data exploration using self-organizing maps. In Acta Polytechnica Scandinavica, Mathematics, Computing and Management in Engineering Series, number 82. 1997. Dr.Tech. thesis, Helsinki University of Technology, Finland.]]Google ScholarGoogle Scholar
  17. 17.S. Kaski. Dimensionality reduction by random mapping. In Proc. Int. Joint Conf. on Neural Networks, volume 1, pages 413-418, 1998.]]Google ScholarGoogle Scholar
  18. 18.E. J. Keogh and M. J. Pazzani. A simple dimensionality reduction technique for fast similarity search in large time series databases. In 4th Pacific-Asia Conf. on Knowledge Discovery and Data Mining, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.J.M. Kleinberg. Two algorithms for nearest-neighbor search in high dimensions. In Proc. 29th A CM Syrup. on Theory of Computing, pages 599-608, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.M. Kurimo. Indexing audio documents by using latent semantic analysis and SOM. In E. Oja and S. Kaski, editors, Kohonen Maps, pages 363-374. Elsevier, 1999.]]Google ScholarGoogle ScholarCross RefCross Ref
  21. 21.R. Ostrovsky and Y. Rabani. Polynomial time approximation schemens for geometric k-clustering. In Proe. 41st Syrup. on Foundations of Computer Science, pages 349-358. IEEE, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22.C.H. Papadimitriou, P. Raghavan, H. Tamaki, and S. Vempala. Latent semantic indexing: A probabilistic analysis. In Proc. 17th ACM Syrup. on the Principles of Database Systems, pages 159-168, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23.K.R. Rao and P. Yip. Discrete Cosine Transform: Algorithms, Advantages, Applications. Academic Press, 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. 24.S. Roweis. EM algorithms for PCA and SPCA. In Neural Information Processing Systems 10, pages 626-632, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25.G. Salton and M.J. McGill. Introduction to modern information retrieval. McGraw-Hill, 1983.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. 26.L. Sirovich and R. Everson. Management and analysis of large scientific datasets. Int. Journal of Supercomputer Applications, 6(1):50-68, spring 1992.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. 27.M. Sonka, V. Hlavac, and R. Boyle. Image processing, analysis, and machine vision. PWS Publishing, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. 28.S. Vempala. Random projection: a new approach to VLSI layout. In Proc. 39th Annual Syrup. on Foundations of Computer Science. IEEE Computer Society Press, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    KDD '01: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
    August 2001
    493 pages
    ISBN:158113391X
    DOI:10.1145/502512

    Copyright © 2001 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 26 August 2001

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    KDD '01 Paper Acceptance Rate31of237submissions,13%Overall Acceptance Rate1,133of8,635submissions,13%

    Upcoming Conference

    KDD '24

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader