Elsevier

Parallel Computing

Volume 56, August 2016, Pages 1-17
Parallel Computing

Fast Matlab compatible sparse assembly on multicore computers

https://doi.org/10.1016/j.parco.2016.04.001Get rights and content

Highlights

  • A novel algorithm for fast general assembly of sparse matrices is proposed.

  • Modifications for efficient parallelization on multicore platforms are detailed.

  • An open source implementation and benchmark tests show noticeable performance.

  • A Matlab interface and scripts are provided for full reproducibility.

Abstract

We develop and implement in this paper a fast sparse assembly algorithm, the fundamental operation which creates a compressed matrix from raw index data. Since it is often a quite demanding and sometimes critical operation, it is of interest to design a highly efficient implementation. We show how to do this, and moreover, we show how our implementation can be parallelized to utilize the power of modern multicore computers. Our freely available code, fully Matlab compatible, achieves about a factor of 5 × in speedup on a typical 6-core machine and 10 × on a dual-socket 16-core machine compared to the built-in serial implementation.

Introduction

The popular Matlab programming environment was originally built around the insight that most computing applications in some way or the other rely on storage and manipulations of one fundamental object — the matrix. In the early 90s an important update was made with the support of a sparse storage format as presented in [7]. In that paper the way sparse matrices are managed in an otherwise dense storage matrix environment is described, including the initial creation of a sparse matrix, some basic manipulations and operations, and fundamental matrix factorizations in sparse format.

As a guiding principle the authors formulate the “time is proportional to flops”–rule [7, p. 334]:

The time required for a sparse matrix operation should be proportional to the number of arithmetic operations on nonzero quantities.

The situation is somewhat different today since flops often can be considered to be “free” while memory transfers are, in most cases, the real bottlenecks of the program. With the multicore era here to stay programs need to be threaded in order to utilize all hardware resources efficiently. This is a non-trivial task and requires some careful design [1].

In this paper we consider a sole sparse operation, namely the initial assembly operation as implemented by the Matlab function sparse;

After the call, S contains a sparse representation of the matrix defined by S(ik,jk)=sk for k a range of indices pointing into the vectors {i, j, s}, and where repeated indices imply that the corresponding elements are to be summed together. Many applications naturally lead to substantial repetitions of indices and the implied reduction must be detected and handled efficiently. For example, in the important case of assembly in linear finite element methods for partial differential equations, the resulting sparse matrix has a sparsity pattern which is identical to that of the matrix representation of the underlying triangular or tetrahedral mesh when viewed as a graph. The number of collisions during the assembly then corresponds exactly to the connectivity of the nodes in this graph.

Since the assembly must be performed before any other matrix operations are executed, the performance may become a bottleneck. The typical example is for dynamic nonlinear partial differential equations (PDEs), where re-assembly occurs many times as a numerical time integration proceeds, including during the iterations of the nonlinear solver. Thus, with the assembly process a quite time-consuming operation which is repeatedly performed, it cannot always be amortized over subsequent operations. Notably, in the truly large case presented in [[9], Section 5.1.2–5.1.3], the performance of the sparse assembly is found to be the reason behind the loss of strong scaling beyond a few thousands of cores.

Algorithms for sparse assembly have caught the attention also by others. General assembly via an intermediate hashed data format is considered in [2], where serial performance experiment in the PETSc library are also reported. As a follow-up on [7], in [15] the design of sparse matrices in Matlab*P, a kind of parallel version of Matlab, is discussed. Unfortunately, little information about the current status of this language is available. More recently, a “graphBLAS” [10] has been suggested, where one of the operations, BuildMatrix, corresponds to the sparse function.

As mentioned, finite element methods naturally lead to the assembly of large sparse matrices. A stack-based representation specially designed for this application is suggested in [8], and is also implemented there using a hybrid parallel programming model on a Cray XE6. Another approach is reported in [4], where the assembly of finite element sparse matrices in both Matlab and Octave is considered using these high-level languages directly.

Using the “time is proportional to flops”–rule as a guiding principle – but paying close attention to memory accesses – we provide a fast re-implementation of the sparse function. The resulting function fsparse is Matlab compatible, memory efficient, and parallelizes well on modern multicore computers. Moreover, it is well tested and has been freely available in the public domain for quite some time.

In Section 2 we describe in some detail the algorithm proposed, which can be understood as an efficient index-based sorting rule. In Section 3 parallelization aspects are discussed and performance experiments are made in Section 4, where the memory bound character of the operation is also highlighted. In general, with most sparse algorithms, there are not enough non-trivial arithmetic operations to hide the format overhead and data transfer costs [3]. A summarizing discussion around these issues is found in Section 5.

The code discussed in the paper is publicly available and the performance experiments reported here can be repeated through the Matlab–scripts we distribute. Refer to Section 5.1 for details.

Section snippets

A fast general algorithm for sparse assembly

In this section we lay out a fast algorithm for assembling sparse matrices from the standard index triplet data. A description of the problem is first offered in Section 2.1, where some alternative approaches and extensions of the problem are also briefly mentioned. The formats of input and output are detailed in Section 2.2 after which the algorithm is presented stepwise in Section 2.3. A concluding complexity analysis in Section 2.4 demonstrates that the algorithm proposed has a favorable

Parallel sparse assembly in shared memory

In this section we present and analyze the threaded version of the assembly algorithm outlined in Section 2. The obvious approach to parallelizing the algorithm is to evenly distribute the input data among the cores and perform a local assembly, then finalize the result by summing these local matrices together. We advocate against this for two reasons: the required amount of working memory is substantially larger with this approach and the final gather operation is a quite complicated task in

Performance experiments

In this section we present results of performance experiments of both the proposed serial and parallel algorithms. To get some baseline results for our index-based sorting algorithm we start by profiling our serial version and briefly compare it with the built-in Matlab function sparse. The results for our threaded version are presented in Section 4.3, where we look at the speedup for the different parts of the assembly process. Although we clearly observe the bandwidth bound character of the

Conclusions

In this paper we devised an index-based implicit sorting algorithm for the assembly of sparse matrices in CCS format given raw index-triplet data. The algorithm was shown to be efficient in terms of memory accesses and does not require much auxiliary memory. We also showed how the algorithm could be modified and parallelized on multicore CPUs with shared memory. The characteristic in terms of memory accesses for our parallel version is remindful of the serial one and results in a good overall

Acknowledgment

The authors would like to thank master’s students Aidin Dadashzadeh and Simon Ternsjö who programmed an early version of the parallel fsparse code [5]. Their work was performed within the project course in Computational Science at the Department of Information Technology, Uppsala University, given by Maya Neytcheva.

The research that lead to this paper was supported by the Swedish Research Council and carried out within the Linnaeus centre of excellence UPMARC, Uppsala Programming for Multicore

References (16)

  • D.E. Knuth

    The art of computer programming, volume 3: sorting and searching

    Addison-Wesley Series in Computer Science

    (1973)
  • K. Asanovic et al.

    The landscape of parallel computing research: a view from Berkeley

    (Dec 2006)
  • M. Aspnäs et al.

    Efficient assembly of sparse matrices using hashing

  • A. Buluc et al.

    Challenges and advances in parallel sparse matrix-matrix multiplication

    ICPP ’08. 37th International Conference on Parallel Processing

    (2008)
  • F. Cuvelier et al.

    An efficient way to perform the assembly of finite element matrices in Matlab and Octave

    (2013)
  • A. Dadashzadeh et al.

    Parallel assembly of sparse matrices using CCS and OpenMP

    (2013)
  • T.A. Davis et al.

    The university of Florida sparse matrix collection

    ACM Trans. Math. Softw.

    (2011)
  • J. Gilbert et al.

    Sparse matrices in MATLAB: design and implementation

    SIAM J. Matrix Anal. Appl.

    (1992)
There are more references available in the full text version of this article.

Cited by (19)

  • Sparse discrete least squares meshless method on multicore computers

    2022, Journal of Computational Science
    Citation Excerpt :

    The performance of the native MATLAB sparse command can be improved by using a non-native function (sparse2) developed as an optimized matrix assembler in light of blocking and vectorized techniques [32]. A comprehensive review of MATLAB sparse commands was done by Engblom et al. [33] who introduced a step by step index-based sorting algorithm. The above-mentioned publications do not obviously address some disadvantages of their proposed algorithms in dealing with cases containing several Degree of Freedom (DOF) per node, low sparsity rate and a large number of collisions.

  • Parallel computing for the topology optimization method: Performance metrics and energy consumption analysis in multiphysics problems

    2021, Sustainable Computing: Informatics and Systems
    Citation Excerpt :

    Table 5 presents the timing comparison between TOP_MEMS_ETM_2D (our proposed code version 3) and top99neo, following the nomenclature used in Ref. [11], where tT is the total time spent in ten iterations; tP is the time spent for all the preliminary operations; tI is the cost for 10 iterations excluding the preprocessing time tP; tA and tS are the overall time spent by the assembly and solver used in FEM and sensitivities, respectively, and tU is the overall time spent for updating the design variables (optimizer). It is worth to note that the proposed top99neo code originally uses the function fsparse from Engblom and Lukarski [96]. Nevertheless, the fsparse MEX function does not compile in MS Windows and then, it is replaced in our tests by the MEX function sparse2 from Ref. [90].

  • Fast native-MATLAB stiffness assembly for SIPG linear elasticity

    2017, Computers and Mathematics with Applications
    Citation Excerpt :

    The native performance is slow, Dabrowski et al. [3] use sparse2 a non-native MATLAB command. Other sparse matrix commands for MATLAB have also been created, investigated, improved and discussed in [10], who also provide their own improvement and graphics processing unit (GPU) implementation. In this paper the SIPG method for linear elasticity is implemented.

View all citing articles on Scopus
View full text