1 Introduction and Motivation
Owner
node.
radiosity
. This behavior is inherent to the pattern of access to shared data among threads of parallel programs. Consider for instance the parallel FFT: threads first process row-wise transforms, then followed by column-wise transforms, spreading input data among cores; moreover, the same FFT coefficients are used by all nodes, so that if some of these coefficients are evicted from the L1 of one core, it is very likely that the L1 of a neighbor core will contain a copy. Similar cases are found in many applications.
-
We propose an approach to reduce the long latency drawback of shared L2 caches, and thus take advantage of their lower miss rate advantage over private L2s.
-
For that purpose, we introduce the CCM, a mechanism for taking advantage of the property that many L1 misses can be serviced by neighbor nodes, thus reducing the L1 miss latency.
-
Thanks to a coherence protocol modification, the CCM can also remove all long-distance accesses when the L1 miss can be serviced by a neighbor node, reducing network hotspots, and further contributing to reducing the overall L1 miss latency.
-
We compare CCM against three state-of-the-art private or shared L2 mechanisms (ASR, DCC, RNUCA), and we show that, on average, CCM outperforms all three mechanisms.
2 Related Work
2.1 Replication
2.2 Migration
2.3 Leveraging Data Proximity
2.4 Hierarchical Coherence
3 Cluster Cache Architecture
3.1 Overview of the CCM Operations
3.2 Hardware Structure of CCM
3.2.1 CTA
3.2.2 MRUTB and CRB
3.2.3 Modifications of Network Interface
3.3 CCM Coherence Protocol
4 Experimental Methodology
Component | Parameter |
---|---|
Cache Technology | 45 nm CMOS NVT |
NoC Technology | 45 nm CMOS NVT |
Memory Technology | 68 nm |
CMP Size | 64 cores |
Processor Model | Sparc V9, in-order |
Frequency | 2 GHz |
Cache Line Size | 64 B |
L1 I-Cache Size/Associativity | 32 KB/2-way |
L1 D-Cache Size/Associativity | 32 KB/2-way |
L1 Load-to-Use Latency | 2-cycle |
L1 Replacement Policy | LRU |
L2 cache Size(total)/Associativity | 16-MB/8-way |
L2 Load-to-Use Latency | 15-cycle |
L2 Replacement Policy | Pseudo-LRU |
Network Topology |
\(8\times 8\) 2D-Mesh |
Garnet Configuration | Fixed |
V-Net NO./V-Channel NO. | 2/2 |
Buffers per V-Channel | 4 |
Crossbar Model | MULTREE |
Virtual Channel Arbiter Model | MATRIX |
Switch Arbiter Model | MATRIX |
Link Length | 2.5 mm |
Flit Width | 32-bit |
Hop Latency | 5-cycle |
External Memory Size | 8-GB |
External Memory Latency | 300 cycles |
raytrace
from the PARSEC raytrace
, we call the PARSEC version praytrace
. The programs are run from beginning to end on Solaris 10 OS. The proposed CCM architecture was compared against the standard distributed shared L2 cache organization with a directory coherence protocol, Adaptive Selective Replication (ASR) [6], Distributed Cooperative Caching (DCC) [17] and Reactive NUCA(RNUCA) [16]. The configurations of the different mechanisms are the following.SPLASH2 | PARSEC | ||
---|---|---|---|
Workloads | Inputs | Workloads | Inputs |
barnes
| 32768 particles |
blackscholes
| simlarge |
cholesky
| tk 29.0 |
bodytrack
| simlarge |
fft
| -m 20 |
canneal
| simlarge |
fmm
| 32,768 particles |
dedup
| simsmall |
lu
|
\(1{,}024\times 1{,}024\)
|
facesim
| simsmall |
ocean
|
\(514\times 514\) grid |
ferret
| simsmall |
radiosity
| room model |
fluidanimate
| simlarge |
radix
| 1M-keys |
freqmine
| simsmall |
1024-radix |
(p)raytrace
| simlarge | |
2M-maxkey |
streamcluster
| simsmall | |
raytrace
| teapot.env |
swaptions
| simlarge |
volrend
| head |
vips
| simlarge |
\(\times \)
264
| simlarge |
4.1 Distributed Shared L2, Directory Coherence (Baseline)
4.2 Cluster Cache Monitor (CCM)
4.3 Adaptive Selective Replication (ASR)
4.4 Distributed Cooperative Caching (DCC)
4.5 Reactive NUCA (RNUCA)
5 Performance Evaluation
5.1 Energy
5.2 Performance
radiosity
, radix
or \(\times \)
264
, DCC or ASR outperform CCM. For \(\times \)
264
, more than 60 % of CCM accesses correspond to the merge case of Fig. 8e, which reduces the memory traffic, but not the latency. For radix
and radiosity
, their input set is small and fits in local private L2s, which then have a definite advantage over a shared L2. For instance, in DCC, respectively 82.5 and 94.6 % of the L1 misses of these two benchmarks are serviced by the local private L2.5.3 L1 Miss Latency Analysis
volrend
in Fig. 12. Still, the overall execution time of DCC will be worse than CCM because, in parallel workloads, the execution time will be determined by the slowest node.
radiosity
.
fft
is 178 cycles, while the baseline latency of radiosity
is 679 cycles. These drastic latency variations are due to the presence of multiple simultaneous requests at some nodes, which is itself due to either simultaneous misses or the unbalanced allocation of home nodes across cores by the operating system. By capturing many of the L1 misses within the cluster, CCM can considerably reduce this congestion e.g., latency is down to 147 cycles for radiosity
, and162 cycles for fft
. The overhead brought by CCM remains small across all benchmarks, as shown in Fig. 14.5.4 Network Traffic
radiosity
in Fig. 16, for the baseline architecture, CCM, ASR, DCC and RNUCA. Dark (blue) squares correspond to high flit activity, while light (green) or white squares correspond to little or no activity. The baseline system exhibits several hotspots, with one in particular at the node located on the 5th row and the 1st column. On the CCM graph, the hotspots have disappeared and the traffic at many nodes has become fairly low. While ASR and DCC also successfully remove hotspots, the overall traffic remains high.5.5 Hardware Cost
Design | Hardware cost (KB) | ||||
---|---|---|---|---|---|
L1 | L2 | Special | Total | Area (%) | |
BASE | 6.5 | 54.5 | 0 | 61 | 0 |
CCM | 6.625 | 30.5 | 6.79 | 43.91 |
\(-\)28 |
ASR | 6.625 | 58.5 | 18 | 83.125 | 36.3 |
DCC | 6.5 | 25.5 | 54.5 | 86.5 | 41.8 |
RNUCA | 6.5 | 57.5 | 0.015625 | 64.02 | 4.9 |
5.6 Cluster Size
5.7 CTA Conflicts
1bank_0buf
) and the optimized CCM design with different configurations, including 4-partition without MRUTB, 4-partition with 8-entry MRUTB, and 4-partition with 16-entry MRUTB. We find that the proposed optimization reduces conflicts by 38 % on average. The 16-entry MRUTB resolved 34 % conflicts on average, while merely increasing the CCM hardware cost by 130 Bytes.