ABSTRACT
FPGAs are being used in large numbers within cloud computing to provide high-performance, low-power alternatives to more traditional computing structures. While FPGAs provide a number of important benefits to cloud computing environments, they are susceptible to radiation-induced soft errors, which can lead to silent data corruption or system instability. Although soft errors within a single FPGA occur infrequently, soft errors in large-scale FPGAs systems can occur at a relatively high rate. This paper investigates the failure rate of several FPGA applications running within an FPGA cloud computing node by performing fault injection experiments to determine the susceptibility of these applications to soft-errors. The results from these experiments suggest that silent data corruption will occur every few hours within a 100,000 node FPGA system and that such a system can only maintain high-levels of reliability for short periods of operation. These results suggest that soft-error detection and mitigation techniques may be needed in large-scale FPGA systems.
- 2006. Measurement and reporting of alpha particle and terrestrial cosmic rayinduced soft errors in semiconductor devices. Retrieved December 12, 2018 from https://www.jedec.org/sites/default/files/docs/JESD89A.pdfGoogle Scholar
- P. Adell et al. 2008. Assessing and mitigating radiation effects in Xilinx SRAM FPGAs. In 2008 European Conference on Radiation and Its Effects on Components and Systems. 418--424.Google ScholarCross Ref
- J. Arram et al. 2015. RAMETHY: Reconfigurable acceleration of bisulfite sequence alignment. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, USA, 250--259. Google ScholarDigital Library
- L. Barroso et al. 2018. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Third edition. Synthesis Lectures on Computer Architecture 13, 3 (2018), 1--189. Google ScholarDigital Library
- R. Baumann. 2001. Soft errors in advanced semiconductor devices -- Part I: The three radiation sources. IEEE Transactions on Device and Materials Reliability 1, 1 (2001), 17--22.Google ScholarCross Ref
- A. Caulfield et al. 2016. A cloud-scale acceleration architecture. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 1--13. Google ScholarDigital Library
- M. Ceschia et al. 2003. Identification and classification of single-event upsets in the configuration memory of SRAM-based FPGAs. IEEE Trans. Nucl. Sci. 50, 6 (2003), 2088--2094.Google ScholarCross Ref
- E. Chung et al. 2018. Serving DNNs in real time at datacenter scale with Project Brainwave. IEEE Micro 38, 2 (2018), 8--20.Google ScholarCross Ref
- C. Clopper and E. Pearson. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26, 4 (1934), 404--413.Google ScholarCross Ref
- Deloitte. 2017. Hitting the accelerator: the next generation of machine-learning chips. Retrieved December 12, 2018 from https://www2.deloitte.com/content/dam/Deloitte/ global/Images/infographics/technologymediatelecommunications/ gx-deloitte-tmt-2018-nextgen-machine-learning-report.pdfGoogle Scholar
- B. Frank. 2017. Microsoft unveils Brainwave, a system for running super-fast AI. Retrieved December 12, 2018 from https://venturebeat.com/2017/08/22/ microsoft-unveils-brainwave-a-system-for-running-super-fast-ai/Google Scholar
- Intel. 2018. Intel FPGA SDK for OpenCL -- Developer Zone. Retrieved December 12, 2018 from https://www.intel.com/content/www/us/en/ programmable/products/design-software/embedded-software-developers/ opencl/developer-zone.htmlGoogle Scholar
- A. Keller et al. 2018. Dynamic SEU Sensitivity of Designs on Two 28-nm SRAMBased FPGA Architectures. IEEE Trans. Nucl. Sci. 65, 1 (2018), 280--287.Google ScholarCross Ref
- S. Mukherjee et al. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36. 29--40. Google ScholarDigital Library
- E. Nurvitadhi et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks?. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, USA, 5--14. Google ScholarDigital Library
- B. Pratt et al. 2006. Improving FPGA Design Robustness with Partial TMR. In 2006 IEEE International Reliability Physics Symposium Proceedings. IEEE, 226--232.Google Scholar
- H. Quinn. 2014. Challenges in Testing Complex Systems. IEEE Trans. Nucl. Sci. 61, 2 (2014), 766--786.Google ScholarCross Ref
- H. Quinn et al. 2013. Fault Simulation and Emulation Tools to Augment Radiation- Hardness Assurance Testing. IEEE Trans. Nucl. Sci. 60, 3 (2013), 2119--2142.Google ScholarCross Ref
- H. Quinn and P. Graham. 2005. Terrestrial-based radiation upsets: a cautionary tale. In 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 193--202. Google ScholarDigital Library
- P. Ramachandran et al. 2008. Statistical Fault Injection. In 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN). IEEE, 122--127.Google Scholar
- E. Schadt et al. 2010. Computational solutions to large-scale data management and analysis. Nature Reviews Genetics 11, 9 (2010), 647--657.Google ScholarCross Ref
- B. Schroeder. 2011. DRAM errors in the wild: A large-scale field study. Commun. ACM 54, 2 (2011), 100--107. Google ScholarDigital Library
- D. Siewiorek and R. Swarz. 1998. Reliable computer systems (third ed.). A. K. Peters, Natick, MA. Google ScholarDigital Library
- A. Silburt et al. 2008. Specification and Verification of Soft Error Performance in Reliable Internet Core Routers. IEEE Trans. Nucl. Sci. 55, 4 (2008), 2389--2398.Google ScholarCross Ref
- I. Stamoulias et al. 2017. Hardware accelerators for financial applications in HDL and High Level Synthesis. In 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation. 278--285.Google ScholarCross Ref
- Terasic. 2018. Stratix V - DE5-Net FPGA Development Kit. Retrieved December 12, 2018 from https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language= English&No=526Google Scholar
- D. Thomas et al. 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, New York, NY, USA, 63--72. Google ScholarDigital Library
- M. Wirthlin. 2015. High-reliability FPGA-Based systems: Space, high-energy physics, and beyond. Proc. IEEE 103, 3 (2015), 379--389.Google ScholarCross Ref
- Xilinx Inc. 2018. Device Reliability Report. Xilinx Inc. Retrieved December 12, 2018 from https://www.xilinx.com/support/documentation/user_guides/ug116.pdfGoogle Scholar
Index Terms
- Impact of Soft Errors on Large-Scale FPGA Cloud Computing
Recommendations
IC Cost Reduction by Applying Embedded Fault Tolerance for Soft Errors
Fault tolerant design is a technique emerging in Integrated Circuits (ICs) to deal with the increasing error susceptibility (Soft Errors, Single Event Upsets, (SEUs)) caused by e.g. alpha particles. A side effect of these methods is that they also ...
Using an FPGA-based fault injection technique to evaluate software robustness under SEEs: A case study
LATW '11: Proceedings of the 2011 12th Latin American Test WorkshopMicroprocessor-based system's robustness under Single Event Effects is a very current concern. A widely adopted solution to make robust a microprocessor-based system consists in modifying the software application by adding redundancy and fault detection ...
An FPGA-Based Approach for Speeding-Up Fault Injection Campaigns on Safety-Critical Circuits
In this paper we describe an FPGA-based approach to speed-up fault injection campaigns for the evaluation of the fault-tolerance of VLSI circuits. Suitable techniques are proposed, allowing emulating the effects of faults and observing faulty behavior. ...
Comments