Abstract
Techniques and algorithms for efficient in-place conversion to and from standard and blocked matrix storage formats are described. Such functionality is required by numerical libraries that use different data layouts internally. Parallel algorithms and a software package for in-place matrix storage format conversion based on in-place matrix transposition are presented and evaluated. A new algorithm for in-place transposition which efficiently determines the structure of the transposition permutation a priori is one of the key ingredients. It enables effective load balancing in a parallel environment.
- Alltop, W. O. 1975. A computer algorithm for transposing nonsquare matrices. IEEE Trans. Comput. 24, 10, 1038--1040. Google ScholarDigital Library
- Bader, M. and Zenger, C. 2006. Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Its Appl. 417, 2--3, 301--313.Google ScholarCross Ref
- Bader, M. and Heinecke, A. 2008. Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms. In Proceedings of the Computing Frontiers Conferene and Co-Located Workshops (MAW'08 and WRFT'08). 385--392. Google ScholarDigital Library
- Berman, M. F. 1958. A method for transposing a matrix. J. ACM 5, 4, 383--384. Google ScholarDigital Library
- Boothroyd, J. 1967. Algorithm 302: Transpose vector stored array. Comm. ACM 10, 5, 292--293. Google ScholarDigital Library
- Brebner, M. A. and Laflin, S. 1970. Algorithm 380: In-situ transposition of a rectangular matrix. Comm. ACM 13, 5, 324--326. Google ScholarDigital Library
- Brenner, N. 1973. Algorithm 467: Matrix transposition in place. Comm. ACM 16, 11, 692--694. Google ScholarDigital Library
- Cate, E. G. and Twigg, D. W. 1977. Algorithm 513: Analysis of in-situ transposition. ACM Trans. Math. Softw. 3, 1, 104--110. Google ScholarDigital Library
- Chan, E., van de Geijn, R., Quintana-Ort, E. S., Quintana-Ort, G., and van Zee, F. G. 2008. Programming algorithms-by-blocks for matrix computations on multithreaded architectures. Tech. rep. TR-08-04, Department of Computer Sciences, University of Texas as Austin.Google Scholar
- Dongarra, J. and Kurzak, J. 2009. QR factorization for the CELL Broadband Engine. Sci. Progr. 17, 1--2, 31--42. Google ScholarDigital Library
- Dow, M. 1995. Transposing a matrix on a vector computer. Parall. Comput. 21, 12, 1997--2005. Google ScholarDigital Library
- Elmroth, E., Gustavson, F., Jonsson, I., and Kågström, B. 2004. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 46, 1, 3--45.Google ScholarCross Ref
- Fich, F. E., Munro, J. I., and Poblete, P. V. 1995. Permuting in place. SIAM J. Comput. 24, 2, 266--278. Google ScholarDigital Library
- Fraser, D. 1976. Array permutation by index-digit permutation. J. ACM 23, 298--309. Google ScholarDigital Library
- Gallivan, K., Jalby, W., Meier, U., and Sameh, A. 1988. The impact of hierarchical memory systems on linear algebra algorithm design. Int. J. Supercomput. Appl. 2, 1, 12--48.Google ScholarDigital Library
- Gustavson, F. 2000. New generalized matrix data structures lead to a variety of high-performance algorithms. In Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software. Kluwer, B.V., 211--234. Google ScholarDigital Library
- Gustavson, F. 2008. The relevance of new data structure approaches for dense linear algebra in the new multicore/manycore environments. Tech. rep. RC24599, IBM Research.Google Scholar
- Gustavson, F. and Swirszcz, T. 2007. In-place transposition of rectangular matrices. In Proceedings of the Applied Parallel Computing. State of the Art in Scientific Computing Conference. B. Kågström et al. Eds., Lecture Notes in Computer Science, vol. 4699, Springer, 560--569. Google ScholarDigital Library
- Gustavson, F., Karlsson, L., and Kågström, B. 2009. Distributed SBP Cholesky factorization algorithms with near-optimal scheduling. ACM Trans. Math. Softw. 36, 2, 11:1--11:25. Google ScholarDigital Library
- Gustavson, F., Henriksson, A., Jonsson, I., Kågström, B., and Ling, P. 1998. Recursive blocked data formats and BLAS's for dense linear algebra algorithms. In Proceedings of the 4th International Workshop on Applied Parallel Computing. Large Scale Scientific and Industrial Problems. Lecture Notes in Computer Science, vol. 1541, 195--206. Google ScholarDigital Library
- Hong, B., Park, N., and Prasanna, V. K. 2003. Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parall. Distrib. Syst. 14, 7, 640--654. Google ScholarDigital Library
- Huang, C. H., Johnson, J. R., Johnson, R. W., Kaushik, S. D., and Sadayappan, P. 1993. Efficient transposition algorithms for large matrices. In Proceedings of Supercomputing'93. 656--665. Google ScholarDigital Library
- IBM. 1986. IBM Engineering and scientific subroutine library. ESSL Guide and Reference, SA22-7272-00.Google Scholar
- Johnson, J. R. 1995. Matrix transposition. Department of Mathematics and Computer Science, Drexel University, Philadelphia, PA 19104. Manuscript. http://cricket.cs.drexel/edu/˜jjohnson/2007-08/fell/cs680/paper/transpose.pdfGoogle Scholar
- Karlsson, L. 2009. Blocked in-place transposition with application to storage format conversion. Tech. rep. UMINF 09.01. ISSN 0348-0542, Department of Computing Science, Umeå University, Umeå, Sweden.Google Scholar
- Knuth, D. E. 1971. Mathematical analysis of algorithms. In Proceedings of IFIP Congress. North-Holland, 19--27.Google Scholar
- Knuth, D. 1998. The Art of Computer Programming. Vols. 1 and 2, 3rd Ed. Addison-Wesley.Google Scholar
- Lam, M. D.,Rothberg, E. E., and Wolf, M. E. 1991. The cache performance and optimizations of blocked algorithms. ACM SIGARCH Computer Architect. News 19, 2, 63--74. Google ScholarDigital Library
- Pall, G. and Seiden, E. 1960. A problem in Abelian groups, with application to the transposition of a matrix on an electronic computer. Math. Comput. 14, 70, 189--192.Google ScholarCross Ref
- Windley, P. F. 1959. Transposing matrices in a digital computer. Comput. J. 2, 1, 47--48.Google ScholarCross Ref
Index Terms
- Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion
Recommendations
Efficient sparse matrix-vector multiplication using cache oblivious extension quadtree storage format
In this paper, we elaborate on improving the sparse matrix storage format to optimize the data locality of sparse matrix-vector multiplication ( S p M V M ) algorithm, and its parallel performance. First of all, we propose a cache oblivious extension ...
Sparse Matrix Computations Using the Quadtree Storage Format
SYNASC '09: Proceedings of the 2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific ComputingComputations with sparse matrices are widespread in scientific projects. Used data format affects strongly the performance. Efficient formats for storing sparse matrices are still under development, since the computation using widely-used formats (like ...
Optimizing Sparse Matrix Vector Multiplication Using Diagonal Storage Matrix Format
HPCC '10: Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and CommunicationsSparse matrix vector multiplication (SpMV) is used in many scientific computations. The main bottleneck of this algorithm is memory bandwidth and many methods reduce memory bandwidth usage by compressing the index array. The matrices from finite ...
Comments