Abstract
Since the advent of high-performance distributed-memory parallel computing, the need for intelligible code has become ever greater. The development and maintenance of libraries for these architectures is simply too complex to be amenable to conventional approaches to implementation. Attempts to employ traditional methodology have led, in our opinion, to the production of an abundance of anfractuous code that is difficult to maintain and almost impossible to upgrade.Having struggled with these issues for more than a decade, we have concluded that a solution is to apply a technique from theoretical computer science, formal derivation, to the development of high-performance linear algebra libraries. We think the resulting approach results in aesthetically pleasing, coherent code that greatly facilitates intelligent modularity and high performance while enhancing confidence in its correctness. Since the technique is language-independent, it lends itself equally well to a wide spectrum of programming languages (and paradigms) ranging from C and Fortran to C++ and Java. In this paper, we illustrate our observations by looking at the Formal Linear Algebra Methods Environment (FLAME), a framework that facilitates the derivation and implementation of linear algebra algorithms on sequential architectures. This environment demonstrates that lessons learned in the distributed-memory world can guide us toward better approaches even in the sequential world.We present performance experiments on the Intel (R) Pentium (R) III processor that demonstrate that high performance can be attained by coding at a high level of abstraction.
- AGARWAL, R., GUSTAVSON,F.,AND ZUBAIR, M. 1994. Exploiting functional parallelism of POWER2 to design high-performance numerical algorithms. IBM J. Res. Dev. 38, 5 (Sept.), 563-576. Google Scholar
- ANDERSON, E., BAI, Z., DEMMEL, J., DONGARRA, J. E., DUCROZ, J., GREENBAUM, A., HAMMARLING,S., MCKENNEY, A. E., OSTROUCHOV,S.,AND SORENSEN, D. 1992. LAPACK Users' Guide. SIAM, Philadelphia. Google Scholar
- ANDERSEN,B.S.,GUSTAVSON,F.G.,AND WASNIEWSKI, J. 2000. A recursive formulation of Cholesky factorization of a matrix in packed storage. LAPACK Working Note 146 CS-00-441, University of Tennessee, Knoxville (May).Google Scholar
- CHOI, J., DONGARRA,J.J.,POZO, R., AND WALKER, D. W. 1992. Scalapack: A scalable linear algebra library for distributed memory concurrent computers. In Proceedings of the Fourth Symposium on the Frontiers of Massively Parallel Computation. IEEE Computer Society Press, 120-127.Google Scholar
- CROUT, P. D. 1941. A short method for evaluating determinants and solving systems of linear equations with real or complex coefficients. Trans AIEE 60, 1235-1240.Google Scholar
- DIJKSTRA, E. W. 2000. Under the spell of Leibniz's dream. Tech. Rep. EWD1298, The University of Texas at Austin (April). http://www.cs.utexas.edu/users/EWD/.Google Scholar
- DONGARRA,J.J.,BUNCH, J. R., MOLER,C.B.,AND STEWART, G. W. 1979. LINPACK Users' Guide SIAM, Philadelphia.Google Scholar
- DONGARRA,J.J.,DUCROZ, J., HAMMARLING,S.,AND DUFF, I. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Soft. 16, 1 (March), 1-17. Google Scholar
- DONGARRA,J.J.,DUCROZ, J., HAMMARLING,S.,AND HANSON, R. J. 1988. An extended set of FORTRAN basic linear algebra subprograms. ACM Trans. Math. Soft. 14, 1 (March), 1-17. Google Scholar
- DONGARRA,J.J.,DUFF,I.S.,SORENSEN,D.C.,AND VAN DER VORST, H. A. 1991. Solving Linear Systems on Vector and Shared Memory Computers. SIAM, Philadelphia, PA. Google Scholar
- DONGARRA,J.J.,GUSTAVSON,F.G.,AND KARP, A. 1984. Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review 26, 1 (Jan.), 91-112.Google Scholar
- ELMROTH,E.AND GUSTAVSON, F. 2000. Applying recursion to serial and parallel QR factorization leads to better performance. IBM J. Res. Dev. 44, 4, 605-624. Google Scholar
- GRIES,D.AND SCHNEIDER, F. B. 1992. A Logical Approach to Discrete Math. Texts and Monographs in Computer Science. Springer Verlag, New York. Google Scholar
- GUNNELS, J. 2001. A systematic approach to the design and analysis of parallel dense linear algebra algorithms. Ph.D dissertation Department of Computer Sciences, The University of Texas. Google Scholar
- GUNNELS, J. A., HENRY,G.M.,AND VAN DE GEIJN, R. A. 2001. A family of high-performance matrix multiplication algorithms. In Computational Science - ICCS 2001, Part I, V. N. Alexandrov, J. J. Dongarra, B. A. Juliano, R. S. Renner, and C. K. Tan, Eds. Lecture Notes in Computer Science 2073. Springer-Verlag, New York, 51-60. Google Scholar
- GUNNELS, J., LIN, C., MORROW,G.,AND VAN DE GEIJN, R. 1998. A flexible class of parallel matrix multiplication algorithms. In Proceedings of First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing (1998 IPPS/SPDP '98). 110-116. Google Scholar
- GUNNELS,J.A.AND VAN DE GEIJN, R. A. 2001a. Developing linear algebra algorithms: A collection of class projects. Tech. Rep. CS-TR-01-19, Department of Computer Sciences, The University of Texas at Austin, May. http://www.cs.utexas.edu/users/flame/pubs.html. Google Scholar
- GUNNELS,J.A.AND VAN DE GEIJN, R. A. 2001b. Formal methods for high-performance linear algebra libraries. In The Architecture of Scientific Software, R. F. Boisvert and P. T. P. Tang, Eds. Kluwer Academic Press, Orlando, FL, 193-210. Google Scholar
- GUNTER,B.C.,REILEY,W.C.,AND VAN DE GEIJN, R. A. 2001. Parallel out-of-core cholesky and qr factorizations with pooclapack. In Proceedings of the 15th International Parallel and Distributed Processing Symposium (IPDPS). IEEE Computer Society Press, Los Alamitos, CA. Google Scholar
- GUSTAVSON, F. G. 1997. Recursion leads to automatic variable blocking for dense linearalgebra algorithms. IBM J. of Res. Dev. 41, 6 (November), 737-755. Google Scholar
- GUSTAVSON, F. G. 2001. New generalized matrix data structures lead to a variety of highperformance algorithms. In The Architecture of Scientific Software, R. F. Boisvert and P. T. P. Tang, Eds: Kluwer Academic Press, Orlando, FL. Google Scholar
- GUSTAVSON, F., HENRIKSSON, A., JONSSON, I., KAGSTROM,B.,AND LING, P. 1998a. Recursive blocked data formats and BLAS's for dense linear algebra algorithms. In Applied Parallel Computing, Large Scale Scientific and Industrial Problems, B. K. et al., Ed. Lecture Notes in Computer Science 1541. Springer-Verlag, New York, 195-206. Google Scholar
- GUSTAVSON, F., HENRIKSSON, A., JONSSON, I., KAGSTROM,B.,AND LING, P. 1998b. Superscalar GEMM- based level 3 BLAS-the on-going evolution of a portable and high-performance library. In Applied Parallel Computing, Large Scale Scientific and Industrial Problems, B. K. et al., Ed. Lecture Notes in Computer Science 1541. Springer-Verlag, New York, 207-215. Google Scholar
- GUSTAVSON,F.AND JONSSON, I. 2000. Minimal storage high-performance Cholesky factorization via blocking and recursion. IBM J. Res. Dev. 44, 6 (November), 823-850. Google Scholar
- KAGSTROM, B., LING,P.,AND LOAN, C. V. 1998. GEMM-based level 3 BLAS: High performance model implementations and performance evaluation benchmark. ACM Trans. Math. Soft. 24,3, 268-302. Google Scholar
- LAWSON, C. L., HANSON,R.J.,KINCAID,D.R.,AND KROGH, F. T. 1979. Basic linear algebra subprograms for Fortran usage. ACM Trans. Math. Soft. 5, 3 (Sept.), 308-323. Google Scholar
- REILEY, W. C. 1999. Efficient parallel out-of-core implementation of the Cholesky factorization. Tech. Rep. CS-TR-99-33, Department of Computer Sciences, The University of Texas at Austin. (Dec.) Undergraduate Honors Thesis. Google Scholar
- REILEY,W.C.AND VAN DE GEIJN, R. A. 1999. POOCLAPACK: Parallel Out-of-Core Linear Algebra Package. Tech. Rep. CS-TR-99-33, Department of Computer Sciences, The University of Texas at Austin (Nov.). Google Scholar
- SMITH,B.T.ET AL. 1976. Matrix Eigensystem Routines-EISPACK Guide, Second ed. Lecture Notes in Computer Science 6. Springer-Verlag, New York.Google Scholar
- SNIR, M., OTTO,S.W.,HUSS-LEDERMAN, S., WALKER,D.W.,AND DONGARRA, J. 1996. MPI: The Complete Reference. The MIT Press. Google Scholar
- STEWART, G. W. 1998. Matrix Algorithms Volume 1: Basic Decompositions. SIAM. Google Scholar
- VAN DE GEIJN, R. A. 1997. Using PLAPACK: Parallel Linear Algebra Package. The MIT Press. Google Scholar
- WHALEY,R.C.AND DONGARRA, J. J. 1998. Automatically tuned linear algebra software. In Proceedings of SC'98. Google Scholar
Index Terms
- FLAME: Formal Linear Algebra Methods Environment
Recommendations
The BLIS Framework: Experiments in Portability
BLIS is a new software framework for instantiating high-performance BLAS-like dense linear algebra libraries. We demonstrate how BLIS acts as a productivity multiplier by using it to implement the level-3 BLAS on a variety of current architectures. The ...
Representing linear algebra algorithms in code: the FLAME application program interfaces
In this article, we present a number of Application Program Interfaces (APIs) for coding linear algebra algorithms. On the surface, these APIs for the MATLAB M-script and C programming languages appear to be simple, almost trivial, extensions of those ...
The science of deriving dense linear algebra algorithms
In this article we present a systematic approach to the derivation of families of high-performance algorithms for a large set of frequently encountered dense linear algebra operations. As part of the derivation a constructive proof of the correctness of ...
Comments