1 Introduction
-
Middle stage (software-controlled noc reconfigurability) To let the software layer control the interconnection topology, we proposed a lightweight mechanism to extend the router microarchitecture and make the actual NoC topology (i.e., called virtual topology) directly controllable through dedicated core instructions (see Sect. 4.2) [18]. Whilst demonstrated on top of an NoC architecture based on multiple rings, the proposed mechanism remains general enough to be adapted to other interconnections, e.g., routers with traditional crossbar switching elements.
-
Bottom stage NoC performance improvement via hybrid topology) To improve the scalability and performance of the NoC infrastructure, a hybrid architecture combining a 2D mesh and rings is proposed. The results show that it is more efficient compared to the traditional pure 2D mesh topology (see Sect. 4.3) [19]. The result shows it performs well by efficiently processing local (rings) and global (2D mesh) traffic. Such configuration allows us to exploit network traffic localisation better, thus facilitating traffic management at a low energy cost.
2 Background and related work
2.1 Dataflow models
2.1.1 Dataflow models: principles
2.1.2 Codelet model
2.2 Network-on-Chip architectures
2.2.1 Primer to virtual mapping
2.2.2 Hybridizing NoCs
3 Overview of the proposed framework
4 Hardware-software co-design framework
4.1 Thread distribution policy
4.1.1 Added software support for mapping threads
-
CreateThread(*code,SS,frame)
Creates a new thread context, i.e., the SS, the type of the thread, a unique identifier, and the required space to hold the thread’s data frame and allows to schedule the execution of threads belonging to the same thread pool (TP) within the same VN. -
CreateAF(*code,SS,frame)
As the above instruction, but it allows a new asynchronous function (i.e., a new thread spawned outside the VN). -
ReadData(offset)
Reads data from a thread’s frame at a specific offset within the frame. -
WriteData(TID,frame,offset,data)
writes data to a thread’s frame at a specific offset within the frame (both within and outside the current VN). -
DecreaseSS(TID,dep_cnt)
: Allows decreasing the SS of a thread by the number of resolved dependencies. -
DeleteThread()
Removes the context of a thread that has completed the execution. -
SetVN(N_pe)
Sets the number of cores being part of each VN and sends a broadcast message to all indicating the number of cores (PEs) composing each VN. -
ConfigRouter(*config,Rd,B)
Allows configuring routers by specifying the memory address where the configuration is stored. The destination router identifier is contained in theRd
variable, while flagB
indicates if the configuration is broadcast to all routers.
DecreaseSS
signalling operation to update the corresponding SS field to optimise the execution.
#pragma omp for
when using OpenMP) with the correct sequence of CreateAF
and CreateThread
instructions. Figure 3 also shows that AF scheduling requests and writing operations remain well confined to the local VN. By monitoring scheduling slots, the hardware unit automatically fires threads that become runnable without explicit instruction. On the contrary, the execution completion is signalled by the DeleteThread
instruction that allows freeing resources held by the thread.4.1.2 Hardware support for thread distribution
CreateThread
or the CreateAF
instruction), it uniquely identifies the PE responsible for the execution of the newly generated thread (i.e., \(\langle N_{id}, C_{id} \rangle _{dst} = H(N_{pe},I_{ex})\)). Once selected, the PE is signalled by sending a message over the network. Since the destination is encoded in the \(T_{id}\), any subsequent operation on the thread can easily be forwarded to its corresponding PE without any calculation. This contributes to the overall speedup of the system.
CreateAF
and CreateThread
requests among the available resources fairly. The effectiveness of the hashing function derives from the ability to avoid collisions, i.e., to limit the number of times two distinct input values result in the same output value for the hashing. In our distributed scheduling scheme, this translates into avoiding different PEs selecting the same destination, given two different \(T_{id}\). In that case, the PEs’ load (i.e., the number of threads to execute) is balanced, thus avoiding the formation of hot spots and increasing the overall system reliability.4.2 Software-controlled NoC reconfigurability
4.2.1 Software interface
-
SetRouterCfg(Rd,Rs,B)
Sends a configuration request to the routers by specifying the memory address where the configuration is stored. VariableRd
specifies the destination router, variableRs
contains the memory address where the configuration is stored, and flagB
(unsigned immediate value) indicates if the request is sent in broadcast to all routers (B
>0), or not (B
\(=\)0); -
ReadCounter(Rd,Rs1,Rs2)
Reads the content of a link’s counter, by specifying the link to read (one of the four bits starting from the LSB position in the variableRs1
must be set), the destination router (variableRs2
), and the variable where the counter content will be stored (Rd
); -
ResetCounter(Rd,Rs,B)
Allows to reset traffic statistics by specifying which links’ counters to reset (four bits starting from the LSB position in the variableRs
are set if the corresponding links’ counter must be reset), the destination router (variableRd
), and the flagB
that indicates if the request is sent in broadcast to all routers (B
>0), or not (B
\(=\)0).
SetRouterCfg
instruction is executed, it generates a corresponding message sent to a specific router. The destination router directly accesses the memory location (actually, the PE performs this operation), where the configuration is stored, by adding a fixed offset to the basic address (Rs
). The operation can be parallelized if multiple routers need to be configured.4.2.2 Hardware support for virtual mapping
4.3 NoC performance improvement via hybrid topology
4.3.1 Modified 2D mesh router
Features | Parameters |
---|---|
No. of input and output ports | 8 each (4 ringlets, 4 mesh) |
Width of each port | 42-bits (32-bits payload, 10-bits header) |
No. of virtual channel | 2 per input port |
Packet switching | Store-and-forward (SAF) |
Switch allocator arbitration | Round-robin |
Packet routing | X–Y dimension order routing |
Router pipeline stages | 4 stages |
-
Routing/flow control module (RF) Extracts the packet header and processes the information to determine the destination router. Suppose the packet destination is within one of the four ringlets belonging to the block. In that case, the RF module selects the corresponding output channel, reducing the latency of the VCA and SA modules. A control signal drives the input MUX at the input port.
-
VC allocator module (VCA) Is responsible for allocating buffer resources for incoming packets by selecting one of the VCs. An allocation request signal (i.e., req\(_{in}\)) is set, and if the selected VC has space to buffer the incoming packet, an acknowledgement signal (i.e., ack_ out) is set too. In this case, the selected VC is also signalled to both the RF and SA modules.
-
Switch allocator module (SA) Performs the two arbitration steps. First, multiple VCs in each input port are arbitrated to select one of the available VCs. Then, each of the selected VCs is routed to the selected output port.
4.3.2 Ring switch
-
The MUX of each input port determines the destination based on the packet’s header information and the arbitration.
-
Packets from the ring ports (see Fig. 12, horizontal dimension) have a higher priority than packets from the processing core or the 2D mesh router. Thus, such packets are moved first from the input port to the output port with minimal delay. This arbitration strategy also ensures that packets already in the main ring traffic flow are quickly routed to prevent the saturation of the network. Specifically, to enable the transfer, the RSW sets the request signal of the next switch in the ring (by following the travelling direction of the packets), waiting for the acknowledge signal to be set by the peer switch.
-
When the master RSW receives a request from the 2D mesh router to inject packets into the ring, two available VC buffers are used to store the packets temporarily. If space is available in the selected VC buffer, RSW enables the corresponding acknowledgement signal of the 2D mesh router. Each buffer will take turns sending out packets via round-robin arbiters to exhibit fairness.
5 Simulation results
5.1 Simulation environment
5.2 Evaluation of thread distribution
CreateThread
requests, while the blue line shows the effective distribution of threads as they have been scheduled by the \(H(\cdot )\) modules. The high fairness in the assignments of the threads to different PEs greatly contributes to the high overall performance of the network. Similar results have been obtained simulating the traffic pattern generated by a block matrix multiplication kernel. Random traffic pattern has also been used to assess NoC throughput and power consumption. Such traffic patterns effectively show the capability of our hashing scheme to balance threads’ requests, thus avoiding overloading particular links.5.3 Simulating NoC reconfigurability feature
5.4 Evaluation of hybrid NoC topology
5.4.1 Resource utilisation analysis
Router | Core support | Resources utilization | Power (W) | |||
---|---|---|---|---|---|---|
LUTs | FFs | BRAMs | Static | Dynamic | ||
2D mesh | 1 | 699 | 572 | 5 | 0.323 | 0.047 |
Proposed Mesh | 16 | 1358 | 968 | 8 | 0.324 | 0.075 |
System configuration (No. PEs) | |||||||
---|---|---|---|---|---|---|---|
16 | 32 | 64 | 128 | 256 | 512 | 1024 | |
Proposed router design | |||||||
LUTs | 0.31 | 0.63 | 1.25 | 2.51 | 5.02 | 10.03 | 20.06 |
FFs | 0.11 | 0.22 | 0.45 | 0.89 | 1.79 | 3.58 | 7.15 |
BRAMs | 0.54 | 1.09 | 2.18 | 4.35 | 8.71 | 17.41 | 34.83 |
Ring switch design | |||||||
LUTs | 0.25 | 0.50 | 0.99 | 1.99 | 3.97 | 7.95 | 15.90 |
FFs | 0.21 | 0.42 | 0.83 | 1.66 | 3.32 | 6.65 | 13.30 |
BRAMs | 2.72 | 5.44 | 10.88 | 21.77 | 43.54 | 87.07 | 174.15 |
Conventional 2D mesh router design | |||||||
LUTs | 2.58 | 2.11 | 4.23 | 20.65 | 41.31 | 82.61 | 165.23 |
FFs | 1.06 | 2.11 | 4.23 | 8.45 | 16.90 | 33.80 | 67.60 |
BRAMs | 5.44 | 10.88 | 21.77 | 43.54 | 87.07 | 174.15 | 348.30 |