Adaptive fault-tolerant architecture and routing algorithm for reliable many-core 3D-NoC systems

https://doi.org/10.1016/j.jpdc.2016.03.014Get rights and content

Highlights

  • Adaptive fault-tolerant 3D-Network-on-Chip system architecture.

  • RAB mechanism for deadlock recovery and fault-tolerance in input-buffers.

  • Traffic-Prediction-Unit technique for congestion relief.

  • Bypass-Link-on-Demand to tackle fault-occurrence in the Crossbar.

  • Fault-tolerance and graceful performance degradation obtained at high fault-rates.

Abstract

During the last few decades, Three-dimensional Network-on-Chips (3D-NoCs) have been showing their advantages against 2D-NoC architectures. This is thanks to the reduced average interconnect length and lower interconnect-power consumption inherited from Three-dimensional Integrated Circuits (3D-ICs). On the other hand, questions about their reliability is starting to arise. This issue is mainly caused by their complex nature where a single faulty transistor may cause intolerable performance degradation or even the entire system collapse. To ensure their correct functionality, 3D-NoC systems must be fault-tolerant to any short-term malfunction or permanent physical damage to ensure message delivery on time while minimizing the performance degradation as much as possible.

In this paper, we present a fault-tolerant 3D-NoC architecture, called 3D-Fault-Tolerant-OASIS (3D-FTO).1 With the aid of a light-weight routing algorithm, 3D-FTO manages to avoid the system failure at the presence of a large number of transient, intermittent, and permanent faults. Moreover, the proposed architecture is leveraging on reconfigurable components to handle the fault occurrence in links, input-buffers, and crossbar, where the faults are more often to happen. The proposed 3D-FTO system is able to work around different kinds of faults ensuring graceful performance degradation while minimizing the additional hardware complexity and remaining power-efficient.

Introduction

During the past few decades, a lot of research has been focusing on Three-dimensional Networks-on-Chips (3D-NoCs)  [28], [17], [2] as an auspicious solution to alleviate the interconnect bottleneck and reduce the power consumption in current System-on-Chips (SoCs) designs. As 3D-NoC architectures started to show their performance benefits and energy efficiency against 2D-NoC systems [3], [8], questions about their reliability to sustain their performance growth began to arise  [9]. This is mainly due to challenges inherited from both Three-dimensional Integrated-Circuits (3D-ICs) and NoCs; on one side, the complex nature of 3D-IC fabrics and the continuing shrinkage of semiconductor components. Furthermore, the significant heterogeneity in 3D chips that are more likely to mix logic layers with memory layers adding more complexity and increasing the fault probability in a system  [22]. The other challenge is that the single-point-failure nature of NoC introduces a big concern to their reliability as they are the sole communication medium.

As a result, 3D-NoC systems are becoming susceptible to a variety of faults caused by crosstalk  [15], impact of radiations  [10], oxide breakdown  [19], and so on  [25]. A simple failure in a single transistor caused by one of these factors may compromise the entire system reliability where the failure can be illustrated in corrupted message delivery, time requirement unsatisfactory, or even sometimes the entire system collapse.

Faults can occur at any component of a 3D-NoC system (i.e., link, router, buffers, crossbar, etc.). Their rate of occurrence depends on the design, technology, environment and operation conditions. From a time perspective, the duration of faults is very important especially for real-time 3D-NoC systems and it can be categorized into three main types  [11]: (1) Transient faults: they occur and remain in the system for a particular period of time before disappearing; (2) Intermittent faults: they are transient faults which occur from time to time; (3) Permanent faults: they start at a particular time and remain in the system until they are repaired. Also, from a locality perspective, it is important to analyze the behavior of faults in the different components of the system to find the ones where the faults are more often to occur. According to  [16], input-buffers and crossbar occupy the largest area in 3D-NoC system that can reach the 80% and 10%, respectively. While each one of the remaining components do not pass the 3% of the router total area. Consuming the largest portion of the router area, the fault occurrence probability is very high in the input-buffers and crossbar if we assume that the faults’ distribution is proportional to the area distribution. Therefore, adopting fault-tolerance for inter-router links (as in most 3D-NoC systems) is not enough to build a reliable system, and faults consideration should also include the buffers and crossbar.

In  [7], we presented a routing algorithm, called Hybrid-Look-Ahead Fault-Tolerant (HLAFT). HLAFT takes advantages of the high-throughput and low-overhead of the previously presented Look-Ahead-Fault-Tolerant routing (LAFT)  [5] and combines it with local-routing for better routing decision. This is gained while simultaneously guaranteeing fault-tolerance with graceful performance degradation. HLAFT also solves the deadlock problem with the aid of Random-Access-Buffer (RAB) mechanism  [7], [6] which showed its capability of detecting and recovering from deadlock at a very low additional hardware complexity. HLAFT was implemented on 3D-OASIS-NoC system  [5] (whose baseline router’s architecture is depicted in Fig. 1) and showed its capability to reduce the communication latency and to recover from deadlock. Despite the advantages gained from the earlier mentioned techniques, the fault-tolerance awareness is limited to only inter- and intra-layer links and does not consider other components of the router, such as input-buffers and crossbar. These limitations diminish the reliability of our system.

Starting from all the facts mentioned above, we propose in this paper 3D-Fault-Tolerant-OASIS (3D-FTO), a robust fault-tolerant 3D-NoC system leveraging on reconfigurable components. The proposed system handles a large number of transient, intermittent, and permanent faults in the input-buffer, crossbar, and links (which are the most susceptible components to faults in 3D-NoC systems) leveraging the inherent structural redundancy in the architecture to work around errors. Contrary to previous works, the proposed system tolerates multiple faults in a single crossbar with no considerable performance degradation. In addition, the used routing algorithm is always minimal (as long as there exist one minimal path) and with the aid of RAB, both fault-tolerance and deadlock-freedom are ensured with no significant area or power overhead. To the best of our knowledge, none of the previously proposed 3D-NoC architectures has dealt with fault-tolerance assuming all the previously three components with different kinds of faults. The main contributions of this work are the following:

  • Adaptive router architecture relying on reconfiguring the most susceptible components to faults with redundant resources to ensure the correct functionality of our system even at high fault-rates:

    • Input-buffer: To encounter these faults, a smart buffering mechanism for deadlock-recovery, named Random-Access-Buffer (RAB)  [7], [6], was extended and endorsed with Traffic-Prediction-Unit (TPU) to tolerate faults in the input-buffer slots.

    • Crossbar: We employed Bypass-Link-on-Demand (BLoD) approach that provides the appropriate and minimal bypass channels as alternative escapes whenever the baseline crossbar channels are detected faulty.

  • Routing: To address link faults, a graceful fault-tolerant routing algorithm, named Look-Ahead-Fault-Tolerant (LAFT)  [5], was optimized in the proposed architecture to mitigate the different kinds of link faults. LAFT takes advantage of look-ahead routing to boost the performance of 3D-NoCs while ensuring link fault-tolerance and minimizing the additional hardware. Moreover, when errors cannot be contained in a single router (entire input-buffer or crossbar is declared faulty), LAFT is invoked to declare the router as faulty, then reconfigured to bypass it to avoid any information loss.

  • Evaluation: The proposed architecture was prototyped on FPGA and evaluated with different parallel large benchmarks and traffic patterns. Evaluation results and analysis are provided to show the benefits gained with the proposed architecture.

The rest of the paper is organized as follows: in Section  2 we present some of the related work to fault-tolerant routing architectures and algorithms used in most popular 2D- and 3D-NoC systems. The proposed 3D-FTO router architecture including the Bypass-Link-on-Demand (BLoD), the Random-Access-Buffer (RAB) mechanism, and the Traffic-Prediction-Unit (TPU) are explained in Section  3. Section  4 gives a brief overview of Look-Ahead-Fault-Tolerant routing algorithm and the main modifications to adopt the proposed architecture. Section  5 is dedicated for the evaluation methodology and results, and finally we end the paper with the conclusion and future work in Section  6.

Section snippets

Related work

Many works have been conducted to tackle the fault-tolerance in NoC systems (2D and 3D) where they can be classified depending on the target system, the fault’s type, or the faults’ handling mechanism (e.g., using routing algorithms or architectural solutions). We previously presented in  [7], [5] some of the well-known routing algorithms used in 3D-NoC systems that focused mainly on link-failure. Another interesting work presented by Radetzki et al.   [25] gives a survey of the different

Router architecture

Fig. 2 depicts the high-level representation of 3D-Fault-Tolerant-OASIS (3D-FTO) baseline router (in white) in addition to the enhancement added (colored) for fault-tolerance enhancement. 3D-FTO router relies on simple recovery techniques based on system reconfiguration with redundant structural resources to contain faults’ occurrence (in input-buffers, crossbar, and links) and prevent from the system failure, or information corruption or loss.

As shown in Fig. 2, 3D-FTO router contains seven

Look-Ahead-Fault-Tolerant routing algorithm

To keep the benefits of look-ahead routing  [2], [4], Look-Ahead-Fault-Tolerant routing algorithm (LAFT)  [5] should be able to perform the routing decision for the next node taking into consideration its link status and select the best minimal path. Before starting to explain LAFT, there are two critical assumptions that should be mentioned. First, the links connecting the PE to the local input and output ports are always nonfaulty. Second, we assume that there exists at least one minimal path

Evaluation methodology

Our proposed 3D-Fault-Tolerant-OASIS (3D-FTO) system was designed on hardware, synthesized and prototyped on commercial CAD tools and FPGA board, respectively  [1]. We evaluate the hardware complexity of the router which 3D-FTO is based upon in terms of area utilization, power consumption (static and dynamic) and speed. To evaluate the performance of the proposed system, we selected Matrix-multiplication  [12], [29] and JPEG-encoder  [20] as real benchmarks and also two traffic patterns:

Conclusion and future work

In this paper, we present a fault-tolerant 3D-NoC architecture, called 3D-Fault-Tolerant-OASIS (3D-FTO). 3D-FTO manages to avoid the system failure in the presence of a large number of faults while ensuring graceful performance degradation and minimizing the additional hardware complexity and remaining power-efficient. In addition to Look-Ahead-Fault-Tolerant (LAFT) routing algorithm previously presented to tackle the faulty-links problem, the proposed architecture is leveraging on

Akram Ben Ahmed received his M.S.E. and Ph.D. degrees in Computer Science and Engineering from the University of Aizu, Japan, in 2012 and 2015, respectively. He is currently a postdoctoral researcher in the Department of Information and Computer Science, Keio University, Japan. His current research interests include on-chip interconnection networks, reliable and fault-tolerant systems, and ultra-low-power embedded real-time systems.

References (29)

  • A. Ben Ahmed et al.

    Graceful Deadlock-Free Fault-Tolerant routing algorithm for 3D network-on-chip architectures

    J. Parallel Distrib. Comput.

    (2014)
  • A. Ben Ahmed, A. Ben Abdallah, LA-XYZ: low latency, high throughput look-ahead routing algorithm for 3D Network-on-Chip...
  • A. Ben Abdallah, M. Sowa, Basic network-on-chip interconnection for future gigascale MCSoCs applications: Communication...
  • A. Ben Ahmed, A. Ben Abdallah, Low-overhead routing algorithm for 3D network-on-chip, in: IEEE Proceedings of The Third...
  • A. Ben Ahmed et al.

    Architecture and Design of High-throughput, low-latency, and Fault-Tolerant routing algorithm for 3D-Network-on-Chip (3D-NoC)

    J. Supercomput.

    (2013)
  • A. Ben Ahmed, A. Ben Abdallah, Fault-tolerant routing algorithm with deadlock recovery support for 3D-NoC...
  • A. Ben Ahmed, A. Ben Abdallah, K. Kuroda, Architecture and design of efficient 3D network-on-chip (3D NoC) for custom...
  • L. Benini et al.

    Networks on Chips: Technology and Tools

    (2006)
  • S. Borkar

    Designing reliable systems from unreliable components: The challenges of transistor variability and degradation

    IEEE Micro

    (2005)
  • A. Burns et al.

    Real-Time Systems and Programming Languages Ada Real-Time Java and C/Real-Time Posix

    (2009)
  • P. Chan, K. Dai, D. Wu, J. Rao, X. Zou, The parallel algorithm implementation of matrix multiplication based on ESCA,...
  • A.A. Chien et al.

    Planar-adaptive routing: Low-cost adaptive networks for multiprocessors

    J. ACM

    (1995)
  • K. Constantinides, et al. BulletProof: A defect-tolerant CMP switch architecture, in: Proc. of the 12th Int. Symp. on...
  • Cited by (0)

    Akram Ben Ahmed received his M.S.E. and Ph.D. degrees in Computer Science and Engineering from the University of Aizu, Japan, in 2012 and 2015, respectively. He is currently a postdoctoral researcher in the Department of Information and Computer Science, Keio University, Japan. His current research interests include on-chip interconnection networks, reliable and fault-tolerant systems, and ultra-low-power embedded real-time systems.

    Abderazek Ben Abdallah is a full Professor at the University of Aizu, Fukushima, Japan, and the Head of the Division of computer engineering since April 2014. He joined the school of computer science and engineering at the University of Aizu in 2007 after serving on the faculty of the graduate school of information systems, the University of Electro-Communications, Tokyo, from 2002 to 2007. He received the B.E. and M.E. degrees in electrical engineering and computer engineering from Huazhong University of Science and Technology in 1994 and 1997 respectively. He received the Ph.D. degree in computer engineering from the University of Electro-Communications, Tokyo, in 2002. His general area of research lies in energy-efficient reliable system design and adaptive multicore system-on-chip design with on-chip learning and cognitive capabilities. Dr. Ben Abdallah is a senior member of IEEE, and a member of ACM and IEICE.

    1

    This project is partially supported by Competitive research funding, Ref. P1-5, Fukushima, Japan.

    View full text