Fast Matlab compatible sparse assembly on multicore computers

doi:10.1016/j.parco.2016.04.001

Parallel Computing

Volume 56, August 2016, Pages 1-17

https://doi.org/10.1016/j.parco.2016.04.001 Get rights and content

Highlights

•
A novel algorithm for fast general assembly of sparse matrices is proposed.
•
Modifications for efficient parallelization on multicore platforms are detailed.
•
An open source implementation and benchmark tests show noticeable performance.
•
A Matlab interface and scripts are provided for full reproducibility.

Abstract

We develop and implement in this paper a fast sparse assembly algorithm, the fundamental operation which creates a compressed matrix from raw index data. Since it is often a quite demanding and sometimes critical operation, it is of interest to design a highly efficient implementation. We show how to do this, and moreover, we show how our implementation can be parallelized to utilize the power of modern multicore computers. Our freely available code, fully Matlab compatible, achieves about a factor of 5 × in speedup on a typical 6-core machine and 10 × on a dual-socket 16-core machine compared to the built-in serial implementation.

Introduction

The popular Matlab programming environment was originally built around the insight that most computing applications in some way or the other rely on storage and manipulations of one fundamental object — the matrix. In the early 90s an important update was made with the support of a sparse storage format as presented in [7]. In that paper the way sparse matrices are managed in an otherwise dense storage matrix environment is described, including the initial creation of a sparse matrix, some basic manipulations and operations, and fundamental matrix factorizations in sparse format.

As a guiding principle the authors formulate the “time is proportional to flops”–rule [7, p. 334]:

The time required for a sparse matrix operation should be proportional to the number of arithmetic operations on nonzero quantities.

The situation is somewhat different today since flops often can be considered to be “free” while memory transfers are, in most cases, the real bottlenecks of the program. With the multicore era here to stay programs need to be threaded in order to utilize all hardware resources efficiently. This is a non-trivial task and requires some careful design [1].

In this paper we consider a sole sparse operation, namely the initial assembly operation as implemented by the Matlab function sparse;

After the call, S contains a sparse representation of the matrix defined by $S (i_{k}, j_{k}) = s_{k}$ for k a range of indices pointing into the vectors {i, j, s}, and where repeated indices imply that the corresponding elements are to be summed together. Many applications naturally lead to substantial repetitions of indices and the implied reduction must be detected and handled efficiently. For example, in the important case of assembly in linear finite element methods for partial differential equations, the resulting sparse matrix has a sparsity pattern which is identical to that of the matrix representation of the underlying triangular or tetrahedral mesh when viewed as a graph. The number of collisions during the assembly then corresponds exactly to the connectivity of the nodes in this graph.

Since the assembly must be performed before any other matrix operations are executed, the performance may become a bottleneck. The typical example is for dynamic nonlinear partial differential equations (PDEs), where re-assembly occurs many times as a numerical time integration proceeds, including during the iterations of the nonlinear solver. Thus, with the assembly process a quite time-consuming operation which is repeatedly performed, it cannot always be amortized over subsequent operations. Notably, in the truly large case presented in [[9], Section 5.1.2–5.1.3], the performance of the sparse assembly is found to be the reason behind the loss of strong scaling beyond a few thousands of cores.

Algorithms for sparse assembly have caught the attention also by others. General assembly via an intermediate hashed data format is considered in [2], where serial performance experiment in the PETSc library are also reported. As a follow-up on [7], in [15] the design of sparse matrices in Matlab*P, a kind of parallel version of Matlab, is discussed. Unfortunately, little information about the current status of this language is available. More recently, a “graphBLAS” [10] has been suggested, where one of the operations, BuildMatrix, corresponds to the sparse function.

As mentioned, finite element methods naturally lead to the assembly of large sparse matrices. A stack-based representation specially designed for this application is suggested in [8], and is also implemented there using a hybrid parallel programming model on a Cray XE6. Another approach is reported in [4], where the assembly of finite element sparse matrices in both Matlab and Octave is considered using these high-level languages directly.

Using the “time is proportional to flops”–rule as a guiding principle – but paying close attention to memory accesses – we provide a fast re-implementation of the sparse function. The resulting function fsparse is Matlab compatible, memory efficient, and parallelizes well on modern multicore computers. Moreover, it is well tested and has been freely available in the public domain for quite some time.

In Section 2 we describe in some detail the algorithm proposed, which can be understood as an efficient index-based sorting rule. In Section 3 parallelization aspects are discussed and performance experiments are made in Section 4, where the memory bound character of the operation is also highlighted. In general, with most sparse algorithms, there are not enough non-trivial arithmetic operations to hide the format overhead and data transfer costs [3]. A summarizing discussion around these issues is found in Section 5.

The code discussed in the paper is publicly available and the performance experiments reported here can be repeated through the Matlab–scripts we distribute. Refer to Section 5.1 for details.

Section snippets

A fast general algorithm for sparse assembly

In this section we lay out a fast algorithm for assembling sparse matrices from the standard index triplet data. A description of the problem is first offered in Section 2.1, where some alternative approaches and extensions of the problem are also briefly mentioned. The formats of input and output are detailed in Section 2.2 after which the algorithm is presented stepwise in Section 2.3. A concluding complexity analysis in Section 2.4 demonstrates that the algorithm proposed has a favorable

Parallel sparse assembly in shared memory

In this section we present and analyze the threaded version of the assembly algorithm outlined in Section 2. The obvious approach to parallelizing the algorithm is to evenly distribute the input data among the cores and perform a local assembly, then finalize the result by summing these local matrices together. We advocate against this for two reasons: the required amount of working memory is substantially larger with this approach and the final gather operation is a quite complicated task in

Performance experiments

In this section we present results of performance experiments of both the proposed serial and parallel algorithms. To get some baseline results for our index-based sorting algorithm we start by profiling our serial version and briefly compare it with the built-in Matlab function sparse. The results for our threaded version are presented in Section 4.3, where we look at the speedup for the different parts of the assembly process. Although we clearly observe the bandwidth bound character of the

Conclusions

In this paper we devised an index-based implicit sorting algorithm for the assembly of sparse matrices in CCS format given raw index-triplet data. The algorithm was shown to be efficient in terms of memory accesses and does not require much auxiliary memory. We also showed how the algorithm could be modified and parallelized on multicore CPUs with shared memory. The characteristic in terms of memory accesses for our parallel version is remindful of the serial one and results in a good overall

Acknowledgment

The authors would like to thank master’s students Aidin Dadashzadeh and Simon Ternsjö who programmed an early version of the parallel fsparse code [5]. Their work was performed within the project course in Computational Science at the Department of Information Technology, Uppsala University, given by Maya Neytcheva.

The research that lead to this paper was supported by the Swedish Research Council and carried out within the Linnaeus centre of excellence UPMARC, Uppsala Programming for Multicore

References (16)

D.E. Knuth
The art of computer programming, volume 3: sorting and searching
Addison-Wesley Series in Computer Science
(1973)
K. Asanovic et al.
The landscape of parallel computing research: a view from Berkeley
(Dec 2006)
M. Aspnäs et al.
Efficient assembly of sparse matrices using hashing
A. Buluc et al.
Challenges and advances in parallel sparse matrix-matrix multiplication
ICPP ’08. 37th International Conference on Parallel Processing
(2008)
F. Cuvelier et al.
An efficient way to perform the assembly of finite element matrices in Matlab and Octave
(2013)
A. Dadashzadeh et al.
Parallel assembly of sparse matrices using CCS and OpenMP
(2013)
T.A. Davis et al.
The university of Florida sparse matrix collection
ACM Trans. Math. Softw.
(2011)
J. Gilbert et al.
Sparse matrices in MATLAB: design and implementation
SIAM J. Matrix Anal. Appl.
(1992)

There are more references available in the full text version of this article.

Cited by (19)

Inverse design of magneto-active metasurfaces and robots: Theory, computation, and experimental validation
2023, Computer Methods in Applied Mechanics and Engineering
Magneto-active structures can undergo rapid and reversible deformations under untethered magnetic fields. The capability to design such structures to achieve programmable shape morphing in three dimensions (3D) under magnetic actuation is highly desirable for many applications. In this work, we develop a multi-physics topology optimization framework for the inverse design of magneto-active metasurfaces that can undergo programmable shape morphing in 3D under external magnetic fields. These metasurfaces remain planar in their initial configurations and are deformed into complex 3D target shapes. The proposed framework accounts for large-deformation kinematics and optimizes both the topologies and magnetization distributions of metasurfaces in conjunction with the directions and magnitudes of the external magnetic fields. We demonstrate the framework in the design of kirigami metasurfaces, bio-inspired robots with “swimming”, “steering”, “walking”, and “climbing” motions, and multi-modal magnetic actuators, and the optimized designs show high precision and performance in achieving complex 3D deformations. We also use a hybrid fabrication procedure to manufacture representative designs and conduct experimental tests to validate their programmed 3D deformations, with results showing good agreement with simulation predictions. We envision that the proposed framework could lead to a systematic and versatile approach for the design of magneto-active metasurfaces for robotics applications.
Sparse discrete least squares meshless method on multicore computers
2022, Journal of Computational Science
Citation Excerpt :
The performance of the native MATLAB sparse command can be improved by using a non-native function (sparse2) developed as an optimized matrix assembler in light of blocking and vectorized techniques [32]. A comprehensive review of MATLAB sparse commands was done by Engblom et al. [33] who introduced a step by step index-based sorting algorithm. The above-mentioned publications do not obviously address some disadvantages of their proposed algorithms in dealing with cases containing several Degree of Freedom (DOF) per node, low sparsity rate and a large number of collisions.
In this paper, a well-parallelized and scalable sparse assembly algorithm that plays a key role in large-scale simulations is developed. Being sometimes the bottleneck of making the most of the computational resources, assembly operation is of great significance to be implemented in a highly efficient manner. The proposed algorithm employs just nodes to construct the compressed global stiffness matrix leading to the optimized assembly procedure in problems containing several unknowns per node. Moreover, Discrete Least Squares Meshless Method is presented in which producing time-consuming background cells for performing Gaussian integration is avoided. However, it takes significant CPU time to construct the coefficient matrix (like other meshless methods) instead of the computationally demanding process of generating FEM meshes. Unlike the finite element method, meshless methods with their optimal radius of support domain (2.0–3.0 times of nodal spacing) contain numerous collisions and low sparsity rate in the assembly operation, hence, the assembly of meshless methods is targeted to demonstrate the superiority of the proposed algorithm. This study compares sparse and dense assembly between meshless and finite element method probably for the first time. The current paper also introduces and evaluates an open-source Parallel Nodal Sparse Assembly package in shared memory architecture which can be easily extended to other parallel schemes. The proposed algorithm not only is compatible with problems including high sparsity rate but also can be highly effective in cases with a large number of collisions and low sparsity rate.
Parallel computing for the topology optimization method: Performance metrics and energy consumption analysis in multiphysics problems
2021, Sustainable Computing: Informatics and Systems
Citation Excerpt :
Table 5 presents the timing comparison between TOP_MEMS_ETM_2D (our proposed code version 3) and top99neo, following the nomenclature used in Ref. [11], where tT is the total time spent in ten iterations; tP is the time spent for all the preliminary operations; tI is the cost for 10 iterations excluding the preprocessing time tP; tA and tS are the overall time spent by the assembly and solver used in FEM and sensitivities, respectively, and tU is the overall time spent for updating the design variables (optimizer). It is worth to note that the proposed top99neo code originally uses the function fsparse from Engblom and Lukarski [96]. Nevertheless, the fsparse MEX function does not compile in MS Windows and then, it is replaced in our tests by the MEX function sparse2 from Ref. [90].
The topology optimization method (TOM) is a valuable tool to obtain conceptual designs in many scientific fields. However, small-scale problems have traditionally been considered due to the high computational resources this method demands. For example, hundreds of costly optimization iterations are needed, in which millions of design variables are used and where simulation of complex multiphysics phenomena could be required. To address this difficulty, the computing capacity can be increased, or efficient code implementations can be used, or a combination of both. Herein, the computing capacity and efficiency are increased simultaneously by programming parallel codes for running on the central processing unit (CPU) and on the graphics processing unit (GPU). A multiphysics problem is used as the optimization application. Specifically, electro-thermo-mechanical (ETM) microactuators are designed by TOM. To achieve this goal, three computer code versions are developed: one optimized sequential code, another using the parallelism offered by the CPU, and a third one using parallel computing on the GPU. Typical code performance metrics such as the execution time and their acceleration are measured. Additionally, an energy consumption analysis is performed for the first time in the context of parallel computing for topology optimization, which is an important topic from large-scale supercomputers to laptops that seek energy-aware methods. The results show that topologies are obtained up to 25 times faster with up to 93% less power consumption when parallel computing is used. This time reduction in TOM allows increasing the topology resolution, the inclusion of multiple physics, and significant energy savings.
Efficient implementation of energy conservation for higher order finite elements with variational integrators
2018, Mathematics and Computers in Simulation
In this paper, we introduce a variational integrator for higher order finite element models. The solution quality of finite element simulation strongly depends on the approximation in time and space. A variational integrator is robust due to conservation of both the balance of total linear momentum and the balance of total angular momentum. It also binds the error in the balance of total energy. We extend the variational integrator to an integrator that conserves the balance of total energy as well. However, the approximation in time cannot mend the approximation error in space. Therefore, we use a higher order approximation in space to obtain a better solution compared to the real live model that should be simulated. We show different numerical examples with Dirichlet and Neumann boundaries. For the Dirichlet boundaries we use the Lagrange multiplier method and for the Neumann boundaries we introduce the forces on the nodes by constant pressure.
Fast native-MATLAB stiffness assembly for SIPG linear elasticity
2017, Computers and Mathematics with Applications
Citation Excerpt :
The native performance is slow, Dabrowski et al. [3] use sparse2 a non-native MATLAB command. Other sparse matrix commands for MATLAB have also been created, investigated, improved and discussed in [10], who also provide their own improvement and graphics processing unit (GPU) implementation. In this paper the SIPG method for linear elasticity is implemented.
When written in MATLAB the finite element method (FEM) can be implemented quickly and with significantly fewer lines, when compared to compiled code. MATLAB is also an attractive environment for generating bespoke routines for scientific computation as it contains a library of easily accessible inbuilt functions, effective debugging tools and a simple syntax for generating scripts. However, there is a general view that MATLAB is too inefficient for the analysis of large problems. Here this preconception is challenged by detailing a vectorised and blocked algorithm for the global stiffness matrix computation of the symmetric interior penalty discontinuous Galerkin (SIPG) FEM. The major difference between the computation of the global stiffness matrix for SIPG and conventional continuous Galerkin approximations is the requirement to evaluate inter-element face terms, this significantly increases the computational effort. This paper focuses on the face integrals as they dominate the computation time and have not been addressed in the existing literature. Unlike existing optimised finite element algorithms available in the literature the paper makes use of only native MATLAB functionality and is compatible with GNU Octave. The algorithm is primarily described for 2D analysis for meshes with homogeneous element type and polynomial order. The same structure is also applied to, and results presented for, a 3D analysis. For problem sizes of $1 0^{6}$ degrees of freedom (DOF), 2D computations of the local stiffness matrices were at least $\approx 24$ times faster, with 13.7 times improvement from vectorisation and a further 1.8 times improvement from blocking. The speed up from blocking and vectorisation is dependent on the computer architecture, with the range of potential improvements shown for two architectures in this paper.
Parallelized 3D CSEM modeling using edge-based finite element with total field formulation and unstructured mesh
2017, Computers and Geosciences
We solve the 3D controlled-source electromagnetic (CSEM) problem using the edge-based finite element method. The modeling domain is discretized using unstructured tetrahedral mesh. We adopt the total field formulation for the quasi-static variant of Maxwell's equation and the computation cost to calculate the primary field can be saved. We adopt a new boundary condition which approximate the total field on the boundary by the primary field corresponding to the layered earth approximation of the complicated conductivity model. The primary field on the modeling boundary is calculated using fast Hankel transform. By using this new type of boundary condition, the computation cost can be reduced significantly and the modeling accuracy can be improved. We consider that the conductivity can be anisotropic. We solve the finite element system of equations using a parallelized multifrontal solver which works efficiently for multiple source and large scale electromagnetic modeling.

View all citing articles on Scopus

View full text

Fast Matlab compatible sparse assembly on multicore computers

Highlights

Abstract

Introduction

Section snippets

A fast general algorithm for sparse assembly

Parallel sparse assembly in shared memory

Performance experiments

Conclusions

Acknowledgment

The landscape of parallel computing research: a view from Berkeley

Efficient assembly of sparse matrices using hashing

Challenges and advances in parallel sparse matrix-matrix multiplication

ICPP ’08. 37th International Conference on Parallel Processing

An efficient way to perform the assembly of finite element matrices in Matlab and Octave

Parallel assembly of sparse matrices using CCS and OpenMP

The university of Florida sparse matrix collection

ACM Trans. Math. Softw.

Sparse matrices in MATLAB: design and implementation

SIAM J. Matrix Anal. Appl.