skip to main content
article

A recursive formulation of Cholesky factorization of a matrix in packed storage

Authors Info & Claims
Published:01 June 2001Publication History
Skip Abstract Section

Abstract

A new compact way to store a symmetric or triangular matrix called RPF for Recursive Packed Format is fully described. Novel ways to transform RPF to and from standard packed format are included. A new algorithm, called RPC for Recursive Packed Cholesky, that operates on the RPG format is presented. ALgorithm RPC is basd on level-3 BLAS and requires variants of algorithms TRSM and SYRK that work on RPF. We call these RP_TRSM and RP_SYRK and find that they do most of their work by calling GEMM. It follows that most of the execution time of RPC lies in GEMM. The advantage of this storage scheme compared to traditional packed and full storage is demonstrated. First, the RPC storage format uses the minimal amount of storage for the symmetric or triangular matrix. Second, RPC gives a level-3 implementation of Cholesky factorization whereas standard packed implementations are only level 2. Hence, the performance of our RPC implementation is decidedly superior. Third, unlike fixed block size algorithms, RPC, requires no block size tuning parameter. We present performance measurements on several current architectures that demonstrate improvements over the traditional packed routines. Also MSP parallel computations on the IBM SMP computer are made. The graphs that are attached in Section 7 show that the RPC algorithms are superior by a factor between 1.6 and 7.4 for order around 1000, and between 1.9 and 10.3 for order around 3000 over the traditional packed algorithms. For some architectures, the RPC performance results are almost the same or even better than the traditional full-storage algorithms results.

References

  1. AGARWAL,R.C.,GUSTAVSON,F.G.,AND ZUBAIR, M. 1994. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM J. Res. Dev. 38, 5 (Sept.), 563-576. Google ScholarGoogle Scholar
  2. ANDERSEN,B.S.,GUSTAVSON, F., KARAIVANOV, A., WASNIEWSKI, J., AND YALAMOV, P. Y. 1999. LAWRA-Linear algebras with recursive algorithms. In Proceedings of the 3rd Interna-tional Conference on Parallel Processing and Applied Mathematics (PPAM '99, Kazimierz Dolny, Poland), J. Szopa. 63-76.Google ScholarGoogle Scholar
  3. ANDERSON, E., BAI, Z., BISCHOF, C., BLACKFORD,L.S.,DEMMEL, J., DONGARRA, J., DU CROZ, J., GREENBAUM, A., HAMMARLING, S., MCKENNEY, A., AND SORENSON, D. 1999. LAPACK Users's Guide. 3rd ed. SIAM, Philadelphia, PA. Google ScholarGoogle Scholar
  4. BILMES, J., ASANOVIC, K., CHIN, C.-W., AND DEMMEL, J. 1997. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology. In Proceedings of the 1997 International Conference on Supercomputing (ICS '97, Vienna, Austria, July 7-11), S. J. Wallach and H. Zima, Chairs. ACM Press, New York, NY, 340-347. Google ScholarGoogle Scholar
  5. DEMMEL, J. W. 1997. Applied Numerical Linear Algebra. SIAM, Philadelphia, PA. Google ScholarGoogle Scholar
  6. DONGARRA,J.AND WASNIEWSKI, J. 1999. High performance linear algebra package- LAPACK90. In Advances in Randomized Parallel Computing, P. M. Pardalos and S. Rajasekaran, Eds. Combinatorial Optimization, vol. 5. Kluwer Academic Publishers, Hingham, MA, 241-275.Google ScholarGoogle Scholar
  7. DONGARRA,J.J.,DU CROZ,J.J.,HAMMARLING, S., AND DUFF, I. S. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1 (Mar.), 1-17. Google ScholarGoogle Scholar
  8. DONGARRA,J.J.,DU CROZ, J., HAMMARLING, S., AND HANSON, R. J. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Softw. 14, 1 (Mar.), 1-17. Google ScholarGoogle Scholar
  9. DONGARRA,J.J.,DUFF,I.S.,SORENSEN,D.C.,AND VAN DER VORST, H. A. 1998. Numerical Linear Algebra for High-Performance Computers. 2nd ed. SIAM, Philadelphia, PA. Google ScholarGoogle Scholar
  10. GOLUB,G.H.AND VAN LOAN, C. F. 1996. Matrix Computations. 3rd ed. Johns Hopkins studies in the mathematical sciences. Johns Hopkins University Press, Baltimore, MD. Google ScholarGoogle Scholar
  11. GUSTAVSON, F., HENRIKSSON, A., JONSSON, I., KAGSTROM, B., AND LING, P. 1998. Recursive blocked data formats and BLAS for dense linear algebra algorithms. In Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems (PARA '98, Umea, Sweden, June), B. Kagstrom, J. Dongarra, E. Elmroth, and J. Wasniewski, Eds. Springer Lecture Notes in Computer Science. Springer-Verlag, Berlin-Heidelberg, Germany, 195-206. Google ScholarGoogle Scholar
  12. GUSTAVSON, F., HENRIKSSON, A., JONSSON, I., KAGSTROM, B., AND LING, P. 1998. Superscalar GEMM-based level 3 BLAS-The on-going evolution of portable and high-performance library. In Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems (PARA '98, Umea, Sweden, June), B. Kagstrom, J. Dongarra, E. Elmroth, and J. Wasniewski, Eds. Springer Lecture Notes in Computer Science. Springer-Verlag, Berlin-Heidelberg, Germany, 207-215. Google ScholarGoogle Scholar
  13. GUSTAVSON, F. G. 1997. Recursion leads to automatic variable blocking for dense linearalgebra algorithms. IBM J. Res. Dev. 41, 6, 737-756. Google ScholarGoogle Scholar
  14. HIGHAM, N. J. 1996. Accuracy and Stability of Numerical Algorithms. SIAM, Philadelphia, PA. Google ScholarGoogle Scholar
  15. IBM. 1997a. IBM engineering and scientific subroutine library for AIX, version 3, volume 1. Pub. No. SA22-7272-0.Google ScholarGoogle Scholar
  16. IBM. 1997b. XL Fortran AIX, language reference, 1st edition, version 5, release 1.Google ScholarGoogle Scholar
  17. IBM. 1997c. XL Fortran AIX User's Guide. 1st ed. IBM Corp., Riverton, NJ. Version 5, release 1.Google ScholarGoogle Scholar
  18. KAGSTROM, B., LING, P., AND VAN LOAN, C. 1998. GEMM-based level 3 BLAS: Highperformance model implementations and performance evaluation benchmark. ACM Trans. Math. Softw. 24, 3 (Sept.), 268-302. Google ScholarGoogle Scholar
  19. LAWSON,C.L.,HANSON,R.J.,KINCAID,D.R.,AND KROGH, F. T. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Softw. 5, 3, 308-323. Google ScholarGoogle Scholar
  20. METCALF,M.AND REID, J. 1999. Fortran 90/95 Explained. 2nd ed. Oxford University Press, Oxford, UK. Google ScholarGoogle Scholar
  21. TOLEDO, S. 1997. Locality of reference in LU decomposition with partial pivoting. SIAM J. Matrix Anal. Appl. 18, 4, 1065-1081. Google ScholarGoogle Scholar
  22. TREFETHEN,L.N.AND BAU, D. 1997. Numerical Lineary Algebra. SIAM, Philadelphia, PA.Google ScholarGoogle Scholar
  23. WASNIEWSKI, J., ANDERSEN,B.S.,AND GUSTAVSON, F. 1998. Recursive formation of Cholesky algorithm in Fortran 90. In Proceedings of the 4th International Workshop on Applied Parallel Computing, Large Scale Scientific and Industrial Problems (PARA '98, Umea, Sweden, June), B. Kagstrom, J. Dongarra, E. Elmroth, and J. Wasniewski, Eds. Springer Lecture Notes in Computer Science. Springer-Verlag, Berlin-Heidelberg, Germany, 574- 578. Google ScholarGoogle Scholar
  24. WHALEY,R.C.AND DONGARRA, J. 1999. ATLAS: Automatically tuned linear algebra software. University of Tennessee at Knoxville, Knoxville, TN. http://www.netlib.org/atlas/. Google ScholarGoogle Scholar

Index Terms

  1. A recursive formulation of Cholesky factorization of a matrix in packed storage

        Recommendations

        Reviews

        Maurice W. Benson

        Due to symmetry in Cholesky factorization, only the lower (or upper) triangular part of the matrix need be stored. With standard packed storage, matrix elements in the triangular part are ordered by columns (or rows) and stored in a one-dimensional array. This presents problems of efficiency where memory hierarchy (CPU, registers, cache, and so on) is better used if a locality of reference (blocking) is present in the algorithm. These considerations motivate a new packed storage arrangement produced by a recursive approach, where at the coarsest level, the triangular part of the matrix is partitioned into an approximately square part and two smaller triangular parts. This process is continued recursively for the two smaller triangular parts, with the final new packing taking the same total space as standard packing. The recursive Cholesky algorithm using this new packed storage gets superior performance by utilizing level-3 BLAS (Basic Linear Algebra Subroutines). Using well-constructed figures, the paper provides a clear description of the algorithms. A significant part of the presentation involves tests on a variety of machines (seven) and over a range of matrix sizes. Standard packing, the new formulation, and full matrix (no packing) algorithms are compared. The new approach is shown to be significantly more effective than the one using standard packing. This work should be of interest to researchers in computational linear algebra, as well as to those developing high- performance software libraries. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader