1 Introduction
2 Background: Hybrid BFS
2.1 The Base Hybrid BFS Algorithm
2.2 Parallel and Distributed BFS Algorithm
-
Each processor is given the segment of the frontier corresponding to their assigned submatrix.
-
Search for parents with the information available locally.
-
Send updates of children that found parents and process updates for own segment of parents.
-
Send completed to the right neighbor and receive completed for the next substep from the left neighbor.
3 Problems of Hybrid BFS in Extreme-Scale Supercomputers
3.1 Problems with the Data Structure of the Adjacency Matrix
3.1.1 Compressed Sparse Row (CSR)
3.1.2 DCSR
3.1.3 Coarse Index + Skip List
3.1.4 Other Sparse Matrix Formats
3.2 Problems with Communication Overhead
Operation | Comm type | Comm complexity per step | Data transfer per each search (64 bit word) |
---|---|---|---|
Transpose | P2P |
O(1) |
\(s_b V / 64\)
|
Frontier Gather | Allgather |
O(1) |
\(s_b VR / 64\)
|
Parent Updates | P2P |
O(C) | 2V
|
Rotate Completed | P2P |
O(C) |
\(s_b VC / 64\)
|
4 Our Extremely Scalable Hybrid BFS
4.1 Bitmap-Based Sparse Matrix Representation
Edges list | SRC | 0 0 6 7 |
DST | 4 5 3 1 | |
CSR | Row-starts | 0 2 2 2 2 2 2 3 4 |
DST | 4 5 3 1 | |
Bitmap-based sparse matrix representation | Offset | 0 1 3 |
Bitmap | 1 0 0 0 0 0 1 1 | |
Row-starts | 0 2 3 4 | |
DST | 4 5 3 1 | |
DCSR | AUX | 0 1 1 3 |
JC | 0 6 7 | |
Row-starts | 0 2 3 4 | |
DST | 4 5 3 1 |
Data structure | CSR | Bitmap-based CSR | ||
---|---|---|---|---|
Order | Actual | Order | Actual | |
Offset | – | – |
\(V'C/64\)
| 32 MB |
Bitmap | – | – |
\(V'C/64\)
| 32 MB |
Row-starts |
\(V'C\)
| 2048 MB |
\(V'p\)
| 190 MB |
DST |
\(V' \hat{d}\)
| 1020 MB |
\(V' \hat{d}\)
| 1020 MB |
Total |
\(V'(C+ \hat{d})\)
| 3068 MB |
\(V'(C/32+p+\hat{d})\)
| 1274 MB |
Data structure | DCSR | Coarse index + Skip list | ||
---|---|---|---|---|
Order | Actual | Order | Actual | |
AUX |
\(V'p\)
| 190 MB | – | – |
JC |
\(V'p\)
| 190 MB | – | – |
Row-starts |
\(V'p\)
| 190 MB |
\(V'C/64\)
| 32 MB |
DST or skip list |
\(V' \hat{d}\)
| 1020 MB |
\(V' \hat{d} + V'p\)
| 1210 MB |
Total |
\(V'(3p+ \hat{d})\)
| 1590 MB |
\(V'(C/64+p+\hat{d})\)
| 1242 MB |
4.2 Reordering of the Vertex IDs
Offset | 0 1 3 |
Bitmap | 1 0 0 0 0 0 1 1 |
SRC(Orig) | 2 0 1 |
Row-starts | 0 2 3 4 |
DST | 2 3 0 1 |
DST(Orig) | 4 5 3 1 |
4.3 Optimizing Inter-Node Communications for Bottom-Up BFS
4.3.1 Optimizing Parent Updates Communication
4.3.2 Overlapping Computation and Communication in Rotate Completed Operation
4.4 Reducing Communication with Better Partitioning
Operation | Comm type | Comm complexity per step | Data transfer per each search (64 bit word) |
---|---|---|---|
Frontier Gather | Allgather |
O(1) |
\(s_b VR / 64\)
|
Parent Updates |
Alltoall
|
\(\textit{O(1)}\)
| 2V
|
Rotate Completed | P2P |
O(C) |
\(s_b VC / 64\)
|