skip to main content
10.1145/3289602.3293911acmconferencesArticle/Chapter ViewAbstractPublication PagesfpgaConference Proceedingsconference-collections
research-article
Public Access

Impact of Soft Errors on Large-Scale FPGA Cloud Computing

Published:20 February 2019Publication History

ABSTRACT

FPGAs are being used in large numbers within cloud computing to provide high-performance, low-power alternatives to more traditional computing structures. While FPGAs provide a number of important benefits to cloud computing environments, they are susceptible to radiation-induced soft errors, which can lead to silent data corruption or system instability. Although soft errors within a single FPGA occur infrequently, soft errors in large-scale FPGAs systems can occur at a relatively high rate. This paper investigates the failure rate of several FPGA applications running within an FPGA cloud computing node by performing fault injection experiments to determine the susceptibility of these applications to soft-errors. The results from these experiments suggest that silent data corruption will occur every few hours within a 100,000 node FPGA system and that such a system can only maintain high-levels of reliability for short periods of operation. These results suggest that soft-error detection and mitigation techniques may be needed in large-scale FPGA systems.

References

  1. 2006. Measurement and reporting of alpha particle and terrestrial cosmic rayinduced soft errors in semiconductor devices. Retrieved December 12, 2018 from https://www.jedec.org/sites/default/files/docs/JESD89A.pdfGoogle ScholarGoogle Scholar
  2. P. Adell et al. 2008. Assessing and mitigating radiation effects in Xilinx SRAM FPGAs. In 2008 European Conference on Radiation and Its Effects on Components and Systems. 418--424.Google ScholarGoogle ScholarCross RefCross Ref
  3. J. Arram et al. 2015. RAMETHY: Reconfigurable acceleration of bisulfite sequence alignment. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, USA, 250--259. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Barroso et al. 2018. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Third edition. Synthesis Lectures on Computer Architecture 13, 3 (2018), 1--189. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Baumann. 2001. Soft errors in advanced semiconductor devices -- Part I: The three radiation sources. IEEE Transactions on Device and Materials Reliability 1, 1 (2001), 17--22.Google ScholarGoogle ScholarCross RefCross Ref
  6. A. Caulfield et al. 2016. A cloud-scale acceleration architecture. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, 1--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Ceschia et al. 2003. Identification and classification of single-event upsets in the configuration memory of SRAM-based FPGAs. IEEE Trans. Nucl. Sci. 50, 6 (2003), 2088--2094.Google ScholarGoogle ScholarCross RefCross Ref
  8. E. Chung et al. 2018. Serving DNNs in real time at datacenter scale with Project Brainwave. IEEE Micro 38, 2 (2018), 8--20.Google ScholarGoogle ScholarCross RefCross Ref
  9. C. Clopper and E. Pearson. 1934. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika 26, 4 (1934), 404--413.Google ScholarGoogle ScholarCross RefCross Ref
  10. Deloitte. 2017. Hitting the accelerator: the next generation of machine-learning chips. Retrieved December 12, 2018 from https://www2.deloitte.com/content/dam/Deloitte/ global/Images/infographics/technologymediatelecommunications/ gx-deloitte-tmt-2018-nextgen-machine-learning-report.pdfGoogle ScholarGoogle Scholar
  11. B. Frank. 2017. Microsoft unveils Brainwave, a system for running super-fast AI. Retrieved December 12, 2018 from https://venturebeat.com/2017/08/22/ microsoft-unveils-brainwave-a-system-for-running-super-fast-ai/Google ScholarGoogle Scholar
  12. Intel. 2018. Intel FPGA SDK for OpenCL -- Developer Zone. Retrieved December 12, 2018 from https://www.intel.com/content/www/us/en/ programmable/products/design-software/embedded-software-developers/ opencl/developer-zone.htmlGoogle ScholarGoogle Scholar
  13. A. Keller et al. 2018. Dynamic SEU Sensitivity of Designs on Two 28-nm SRAMBased FPGA Architectures. IEEE Trans. Nucl. Sci. 65, 1 (2018), 280--287.Google ScholarGoogle ScholarCross RefCross Ref
  14. S. Mukherjee et al. 2003. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36. 29--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. E. Nurvitadhi et al. 2017. Can FPGAs beat GPUs in accelerating next-generation deep neural networks?. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, New York, NY, USA, 5--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Pratt et al. 2006. Improving FPGA Design Robustness with Partial TMR. In 2006 IEEE International Reliability Physics Symposium Proceedings. IEEE, 226--232.Google ScholarGoogle Scholar
  17. H. Quinn. 2014. Challenges in Testing Complex Systems. IEEE Trans. Nucl. Sci. 61, 2 (2014), 766--786.Google ScholarGoogle ScholarCross RefCross Ref
  18. H. Quinn et al. 2013. Fault Simulation and Emulation Tools to Augment Radiation- Hardness Assurance Testing. IEEE Trans. Nucl. Sci. 60, 3 (2013), 2119--2142.Google ScholarGoogle ScholarCross RefCross Ref
  19. H. Quinn and P. Graham. 2005. Terrestrial-based radiation upsets: a cautionary tale. In 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. 193--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Ramachandran et al. 2008. Statistical Fault Injection. In 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN). IEEE, 122--127.Google ScholarGoogle Scholar
  21. E. Schadt et al. 2010. Computational solutions to large-scale data management and analysis. Nature Reviews Genetics 11, 9 (2010), 647--657.Google ScholarGoogle ScholarCross RefCross Ref
  22. B. Schroeder. 2011. DRAM errors in the wild: A large-scale field study. Commun. ACM 54, 2 (2011), 100--107. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. Siewiorek and R. Swarz. 1998. Reliable computer systems (third ed.). A. K. Peters, Natick, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. A. Silburt et al. 2008. Specification and Verification of Soft Error Performance in Reliable Internet Core Routers. IEEE Trans. Nucl. Sci. 55, 4 (2008), 2389--2398.Google ScholarGoogle ScholarCross RefCross Ref
  25. I. Stamoulias et al. 2017. Hardware accelerators for financial applications in HDL and High Level Synthesis. In 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation. 278--285.Google ScholarGoogle ScholarCross RefCross Ref
  26. Terasic. 2018. Stratix V - DE5-Net FPGA Development Kit. Retrieved December 12, 2018 from https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language= English&No=526Google ScholarGoogle Scholar
  27. D. Thomas et al. 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. ACM, New York, NY, USA, 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Wirthlin. 2015. High-reliability FPGA-Based systems: Space, high-energy physics, and beyond. Proc. IEEE 103, 3 (2015), 379--389.Google ScholarGoogle ScholarCross RefCross Ref
  29. Xilinx Inc. 2018. Device Reliability Report. Xilinx Inc. Retrieved December 12, 2018 from https://www.xilinx.com/support/documentation/user_guides/ug116.pdfGoogle ScholarGoogle Scholar

Index Terms

  1. Impact of Soft Errors on Large-Scale FPGA Cloud Computing

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in
                    • Published in

                      cover image ACM Conferences
                      FPGA '19: Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
                      February 2019
                      360 pages
                      ISBN:9781450361378
                      DOI:10.1145/3289602

                      Copyright © 2019 ACM

                      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      • Published: 20 February 2019

                      Permissions

                      Request permissions about this article.

                      Request Permissions

                      Check for updates

                      Qualifiers

                      • research-article

                      Acceptance Rates

                      Overall Acceptance Rate125of627submissions,20%

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader