25042018  S.I.: Emerging Intelligent Algorithms for EdgeofThings Computing  Issue 5/2019 Open Access
Using hardware counterbased performance model to diagnose scaling issues of HPC applications
 Journal:
 Neural Computing and Applications > Issue 5/2019
1 Introduction
2 The performance diagnostic framework
Resourcebased metrics  Description  How are they derived? 

\(T\_stall_i\)
 Waiting time for memory  Fitting from RESOURCE_STALLS.LB(SB) counter 
\(T\_L1_i\)
 L1 cache access time  Fitting from MEM_LOAD_UOPS_RETIRED.L1_HIT_PS counter 
\(T\_L2_i\)
 L2 cache access time  Fitting from MEM_LOAD_UOPS_RETIRED.L2_HIT_PS counter 
\(T\_LLC_i\)
 Lastlevel cache access time  Fitting from MEM_LOAD_UOPS_RETIRED.LLC_HIT_PS counter 
\(T\_mainmemory_i\)
 Main memory time  Fitting from MEM_UOPS_RETIRED.ALL_LOADS(STORES)_PS counter 
Instructions  Executed instructions  Fitting from INST_RETIRED.ANY_P counter 
\(T\_collective\)
 Collective MPI communication  Fitting from P and S 
\(T\_p2p\)
 Pointtopoint MPI communication  Fitting from S 
\(T\_others\)
 Initialization and finalization time  Fitting from P 
Definitions  Description  How are they derived? 

S
 Total communication volume  Measured 
P
 Number of processes  Input 
D
 Problem size  Input 
n
 Total number of kernels  Detected 
Timebased metrics  Description  How are they derived? 

\(T\_comp_i\)
 Calculation time of each kernel 
\(T\_comp_i=\frac{instructions_i \ *\ CPI_i }{ CPU frequency\ *\ P }\)

\(T\_mem_i\)
 Total memory time 
\(T\_mem_i=T\_L1_i+T\_L2_i+T\_LLC_i+T\_mainmemory_i\)

\(BF\_mem_i\)
 Ratio of nonoverlapped memory time 
\(BF\_mem_i=\frac{T\_stall_i}{T\_mem_i}\)

\(BF\_comm\)
 Ratio of nonoverlapped communication time 
\(BF\_comm=\frac{T\_{mapp}T\_{mcomp}}{T\_{mcomm}}\)

\(T\_comm\)
 Average communication time 
\(T\_comm=\sum _{i=1}^{r} T\_p2p+\sum _{i=1}^{l}T\_collective\)

\(CPI_i\)
 Cycles per instruction  Measured 
\(T\_{mapp}\)
 Total application time  Measured 
\(T\_{mcomp}\)
 Total computation time  Measured 
\(T\_{mcomm}\)
 Total communication time  Measured 
3 Evaluation
3.1 HOMME
Case name  ne32  ne256 

Horizontal grids 
\(32 \cdot 32 \cdot 6\)

\(256 \cdot 256 \cdot 6\)

Numerical method  Spectral element  
Simulation method  J & W baroclinic instability  
Vertical layer  128  
Simulation time  2h simulation 
3.1.1 Strong scaling diagnostics
3.1.2 Week scaling diagnostics
3.1.3 Kernel ranking diagnostics
3.2 CICE
Case name  gx3  gx1 

Horizontal grids 
\(116 \cdot 100\)

\(384 \cdot 320\)

Grid decomposition method  slenderX2  
Simulation time  1 month 
3.2.1 Strong scaling diagnostics
Number of cores  Theoretical  Measured 

2  0.29  0.01 
8  1.18  0.05 
20  2.94  0.07 
3.2.2 Week scaling diagnostics
3.3 OpenFOAM
Case name  motorBike 

Cell number 
\(20 \cdot 8 \cdot 8\)

Decomposition method  ptscotch 
Simulation time  200 steps 