research-article

Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

Authors:
Fred Gustavson

IBM T.J. Watson Research Center, Emeritus, and Umeå University

IBM T.J. Watson Research Center, Emeritus, and Umeå University
View Profile

,
Lars Karlsson

Umeå University

Umeå University
View Profile

,
Bo Kågström

Umeå University

Umeå University
View Profile

ACM Transactions on Mathematical Software Volume 38 Issue 3Article No.: 17pp 1–32https://doi.org/10.1145/2168773.2168775

Published:01 April 2012Publication History

ACM Transactions on Mathematical Software

Abstract

Techniques and algorithms for efficient in-place conversion to and from standard and blocked matrix storage formats are described. Such functionality is required by numerical libraries that use different data layouts internally. Parallel algorithms and a software package for in-place matrix storage format conversion based on in-place matrix transposition are presented and evaluated. A new algorithm for in-place transposition which efficiently determines the structure of the transposition permutation a priori is one of the key ingredients. It enables effective load balancing in a parallel environment.

References

Alltop, W. O. 1975. A computer algorithm for transposing nonsquare matrices. IEEE Trans. Comput. 24, 10, 1038--1040. Google ScholarDigital Library
Bader, M. and Zenger, C. 2006. Cache oblivious matrix multiplication using an element ordering based on a Peano curve. Linear Algebra Its Appl. 417, 2--3, 301--313.Google ScholarCross Ref
Bader, M. and Heinecke, A. 2008. Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms. In Proceedings of the Computing Frontiers Conferene and Co-Located Workshops (MAW'08 and WRFT'08). 385--392. Google ScholarDigital Library
Berman, M. F. 1958. A method for transposing a matrix. J. ACM 5, 4, 383--384. Google ScholarDigital Library
Boothroyd, J. 1967. Algorithm 302: Transpose vector stored array. Comm. ACM 10, 5, 292--293. Google ScholarDigital Library
Brebner, M. A. and Laflin, S. 1970. Algorithm 380: In-situ transposition of a rectangular matrix. Comm. ACM 13, 5, 324--326. Google ScholarDigital Library
Brenner, N. 1973. Algorithm 467: Matrix transposition in place. Comm. ACM 16, 11, 692--694. Google ScholarDigital Library
Cate, E. G. and Twigg, D. W. 1977. Algorithm 513: Analysis of in-situ transposition. ACM Trans. Math. Softw. 3, 1, 104--110. Google ScholarDigital Library
Chan, E., van de Geijn, R., Quintana-Ort, E. S., Quintana-Ort, G., and van Zee, F. G. 2008. Programming algorithms-by-blocks for matrix computations on multithreaded architectures. Tech. rep. TR-08-04, Department of Computer Sciences, University of Texas as Austin.Google Scholar
Dongarra, J. and Kurzak, J. 2009. QR factorization for the CELL Broadband Engine. Sci. Progr. 17, 1--2, 31--42. Google ScholarDigital Library
Dow, M. 1995. Transposing a matrix on a vector computer. Parall. Comput. 21, 12, 1997--2005. Google ScholarDigital Library
Elmroth, E., Gustavson, F., Jonsson, I., and Kågström, B. 2004. Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 46, 1, 3--45.Google ScholarCross Ref
Fich, F. E., Munro, J. I., and Poblete, P. V. 1995. Permuting in place. SIAM J. Comput. 24, 2, 266--278. Google ScholarDigital Library
Fraser, D. 1976. Array permutation by index-digit permutation. J. ACM 23, 298--309. Google ScholarDigital Library
Gallivan, K., Jalby, W., Meier, U., and Sameh, A. 1988. The impact of hierarchical memory systems on linear algebra algorithm design. Int. J. Supercomput. Appl. 2, 1, 12--48.Google ScholarDigital Library
Gustavson, F. 2000. New generalized matrix data structures lead to a variety of high-performance algorithms. In Proceedings of the IFIP TC2/WG2.5 Working Conference on the Architecture of Scientific Software. Kluwer, B.V., 211--234. Google ScholarDigital Library
Gustavson, F. 2008. The relevance of new data structure approaches for dense linear algebra in the new multicore/manycore environments. Tech. rep. RC24599, IBM Research.Google Scholar
Gustavson, F. and Swirszcz, T. 2007. In-place transposition of rectangular matrices. In Proceedings of the Applied Parallel Computing. State of the Art in Scientific Computing Conference. B. Kågström et al. Eds., Lecture Notes in Computer Science, vol. 4699, Springer, 560--569. Google ScholarDigital Library
Gustavson, F., Karlsson, L., and Kågström, B. 2009. Distributed SBP Cholesky factorization algorithms with near-optimal scheduling. ACM Trans. Math. Softw. 36, 2, 11:1--11:25. Google ScholarDigital Library
Gustavson, F., Henriksson, A., Jonsson, I., Kågström, B., and Ling, P. 1998. Recursive blocked data formats and BLAS's for dense linear algebra algorithms. In Proceedings of the 4th International Workshop on Applied Parallel Computing. Large Scale Scientific and Industrial Problems. Lecture Notes in Computer Science, vol. 1541, 195--206. Google ScholarDigital Library
Hong, B., Park, N., and Prasanna, V. K. 2003. Tiling, block data layout, and memory hierarchy performance. IEEE Trans. Parall. Distrib. Syst. 14, 7, 640--654. Google ScholarDigital Library
Huang, C. H., Johnson, J. R., Johnson, R. W., Kaushik, S. D., and Sadayappan, P. 1993. Efficient transposition algorithms for large matrices. In Proceedings of Supercomputing'93. 656--665. Google ScholarDigital Library
IBM. 1986. IBM Engineering and scientific subroutine library. ESSL Guide and Reference, SA22-7272-00.Google Scholar
Johnson, J. R. 1995. Matrix transposition. Department of Mathematics and Computer Science, Drexel University, Philadelphia, PA 19104. Manuscript. http://cricket.cs.drexel/edu/&tilde;jjohnson/2007-08/fell/cs680/paper/transpose.pdfGoogle Scholar
Karlsson, L. 2009. Blocked in-place transposition with application to storage format conversion. Tech. rep. UMINF 09.01. ISSN 0348-0542, Department of Computing Science, Umeå University, Umeå, Sweden.Google Scholar
Knuth, D. E. 1971. Mathematical analysis of algorithms. In Proceedings of IFIP Congress. North-Holland, 19--27.Google Scholar
Knuth, D. 1998. The Art of Computer Programming. Vols. 1 and 2, 3rd Ed. Addison-Wesley.Google Scholar
Lam, M. D.,Rothberg, E. E., and Wolf, M. E. 1991. The cache performance and optimizations of blocked algorithms. ACM SIGARCH Computer Architect. News 19, 2, 63--74. Google ScholarDigital Library
Pall, G. and Seiden, E. 1960. A problem in Abelian groups, with application to the transposition of a matrix on an electronic computer. Math. Comput. 14, 70, 189--192.Google ScholarCross Ref
Windley, P. F. 1959. Transposing matrices in a digital computer. Comput. J. 2, 1, 47--48.Google ScholarCross Ref

Index Terms

Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion
1. Mathematics of computing
  1. Discrete mathematics
    1. Combinatorics
      1. Permutations and combinations
  2. Mathematical software

Recommendations

Efficient sparse matrix-vector multiplication using cache oblivious extension quadtree storage format

In this paper, we elaborate on improving the sparse matrix storage format to optimize the data locality of sparse matrix-vector multiplication ( S p M V M ) algorithm, and its parallel performance. First of all, we propose a cache oblivious extension ...
Read More
Sparse Matrix Computations Using the Quadtree Storage Format
SYNASC '09: Proceedings of the 2009 11th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing

Computations with sparse matrices are widespread in scientific projects. Used data format affects strongly the performance. Efficient formats for storing sparse matrices are still under development, since the computation using widely-used formats (like ...
Read More
Optimizing Sparse Matrix Vector Multiplication Using Diagonal Storage Matrix Format
HPCC '10: Proceedings of the 2010 IEEE 12th International Conference on High Performance Computing and Communications

Sparse matrix vector multiplication (SpMV) is used in many scientific computations. The main bottleneck of this algorithm is memory bandwidth and many methods reduce memory bandwidth usage by compressing the index array. The matrices from finite ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Mathematical Software Volume 38, Issue 3
April 2012
157 pages
ISSN:0098-3500
EISSN:1557-7295
DOI:10.1145/2168773
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2012
- Accepted: 1 July 2011
- Revised: 1 February 2011
- Received: 1 February 2010
Published in toms Volume 38, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Blocked matrix data layout
in-place matrix transposition
parallel and cache-efficient algorithms
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 52
  Total Citations
  View Citations
- 588
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

ACM Transactions on Mathematical Software

Abstract

References

Cited By

Index Terms

Recommendations

Efficient sparse matrix-vector multiplication using cache oblivious extension quadtree storage format

Sparse Matrix Computations Using the Quadtree Storage Format

Optimizing Sparse Matrix Vector Multiplication Using Diagonal Storage Matrix Format

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Parallel and Cache-Efficient In-Place Matrix Storage Format Conversion

ACM Transactions on Mathematical Software

Abstract

References

Cited By

Index Terms

Recommendations

Efficient sparse matrix-vector multiplication using cache oblivious extension quadtree storage format

Sparse Matrix Computations Using the Quadtree Storage Format

Optimizing Sparse Matrix Vector Multiplication Using Diagonal Storage Matrix Format

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media