1 Introduction
2 Gauss–Jordan-elimination
2.1 Basic scheme
2.2 Block reformulation
2.3 Distributed algorithm
2.4 Memory access analysis
3 GPU implementation
3.1 Look-ahead and asynchronous operation
3.2 Data layout
Dimension | Swap | Swap + transpose | |
---|---|---|---|
m
|
n
| Column major (%) | Row major (%) |
5120 | 1 | 32.5 | 14.9 |
5120 | 37.5 | 8.6 | |
10,240 | 1 | 24.8 | 7.2 |
10,240 | 24.4 | 5.0 | |
15,360 | 1 | 21.7 | 4.6 |
15,360 | 20.0 | 3.7 |
GEMM(
\(\alpha \), A
, B
, \(\beta \), C)
computes \(C:=\alpha AB + \beta C\), where \(A\in \mathbb {R}^{m\times k}\), \(B\in \mathbb {R}^{k\times n}\), and \(C\in \mathbb {R}^{m\times n}\) stored in (Fortran) column major order. Furthermore, it is obvious that \(A^T\) in column major storage is the same as A in row major storage. It follows that the column major oriented routine GEMM(
\(\alpha \), B
, A
, \(\beta \), C)
then computes4 Experimental results
\(m=n\)
| Energy optimal | Time optimal | ||||
---|---|---|---|---|---|---|
\(N_B\)
| Energy | Time |
\(N_B\)
| Time | Energy | |
2048 | 256 | 23.23 | 0.06 | 256 | 0.06 | 23.23 |
4096 | 512 | 123.49 | 0.32 | 640 | 0.31 | 124.33 |
6144 | 768 | 378.11 | 0.91 | 1024 | 0.90 | 379.56 |
8192 | 1024 | 854.20 | 2.00 | 1280 | 1.98 | 854.56 |
10,240 | 1152 | 1608.78 | 3.73 | 1536 | 3.68 | 1614.10 |
12,288 | 896 | 2614.01 | 6.34 | 1536 | 6.29 | 2680.57 |
14,336 | 1408 | 3896.29 | 9.73 | 2048 | 9.53 | 3936.05 |
16,384 | 1024 | 5568.58 | 14.36 | 1408\(^\mathrm{a}\)
| 14.17 | 5578.11 |
\(m=n\)
| 8192 | 10,240 | 12,288 | 14,336 | 16,384 |
---|---|---|---|---|---|
Energy opt. | 1708.4 | 6000.7 | 16,572.8 | 37,910.9 | 79,964.8 |
Time opt. | 1692.0 | 5939.9 | 16,860.8 | 37,510.6 | 79,041.8 |
\(m=n\)
| MAGMA | GJE: 1 GPU | GJE: 2 GPU | ||
---|---|---|---|---|---|
Time | Time | Speed up | Time | Speed up | |
1024 | 0.03 | 0.02 | 1.50 | 0.02 | 1.50 |
2048 | 0.11 | 0.06 | 1.83 | 0.06 | 1.83 |
3072 | 0.23 | 0.15 | 1.53 | 0.11 | 2.09 |
4096 | 0.47 | 0.31 | 1.52 | 0.19 | 2.47 |
5120 | 0.80 | 0.56 | 1.43 | 0.32 | 2.50 |
6144 | 1.17 | 0.90 | 1.30 | 0.51 | 2.29 |
7168 | 1.80 | 1.37 | 1.31 | 0.76 | 2.37 |
8192 | 2.35 | 1.98 | 1.19 | 1.09 | 2.15 |
\(m=n\)
| MAGMA | GJE: 1 GPU | GJE: 2 GPU | ||
---|---|---|---|---|---|
Energy | Energy | Savings (%) | Energy | Savings (%) | |
1024 | 6.45 | 2.64 | 59.07 | 3.35 | 48.06 |
2048 | 37.75 | 23.23 | 38.46 | 20.72 | 45.11 |
3072 | 85.23 | 62.87 | 26.24 | 46.13 | 45.88 |
4096 | 171.60 | 124.33 | 27.55 | 85.20 | 50.35 |
5120 | 288.23 | 227.49 | 21.07 | 150.72 | 47.71 |
6144 | 415.02 | 379.56 | 8.55 | 253.48 | 38.92 |
7168 | 639.03 | 585.90 | 8.31 | 393.24 | 38.46 |
8192 | 845.31 | 854.56 |
\(-\)1.09 | 586.76 | 30.59 |
\(m=n\)
| MAGMA | MAGMA | GJE | GJE |
---|---|---|---|---|
1 GPU | 2 GPU/CPU | 1 GPU | 2 GPU | |
2048 | 4.3 | 2.7 | 1.4 | 1.2 |
4096 | 80.0 | 88.5 | 38.8 | 15.8 |
6144 | 483.6 | 718.1 | 343.5 | 128.4 |
8192 | 1983.6 | 4184.7 | 1704.8 | 639.5 |
10,240 | 6990.3 | 13,678.1 | 6001.7 | 2262.0 |
12,288 | 18,123.5 | 40,545.6 | 16,578.6 | 6399.3 |
14,336 | 41,601.4 | 93,147.9 | 37,919.4 | 15,364.3 |
16,384 | 83,609.4 | 230,820.8 | 79,985.4 | 32,713.2 |
18,432 | – | 430,949.8 | – | 63,212.7 |
20,480 | – | 829,642.6 | – | 114,236.3 |
22,528 | – | 1,355,716.0 | – | 196,836.7 |
-
Gauss–Jordan is an all-at-once approach where one algorithm and one sweep over the augmented matrix D yield the solution of the problem. In comparison to the LU decomposition this reduces the memory transfers, since no additional triangular solves are required.
-
The GEMM operation is the only operation necessary on the device. It does not involve data dependencies which yield partly sequential computation schemes, in contrast to the forward/backward substitution. Furthermore, the GEMM operation is known to be the best optimized operation on CPUs, as well as, on GPUs.
-
In contrast to the LU decomposition, the GEMM operation inside the Gauss–Jordan-elimination scheme is constant in the number of affected rows and by choosing a proper block size the GEMM operation makes use of the whole device.
-
By removing the data dependencies introduced by the forward/backward substitution scheme and only relying on the GEMM operation one can easily obtain and scalable distributed scheme to employ more than one GPU.