Skip to main content
Top
Published in: The Journal of Supercomputing 4/2016

01-04-2016

Soft error resilience in Big Data kernels through modular analysis

Authors: Sui Chen, Greg Bronevetsky, Lu Peng, Bin Li, Xin Fu

Published in: The Journal of Supercomputing | Issue 4/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The shrinking processor feature and operating voltages of processor circuits are making them increasingly vulnerable to soft faults, which calls for fault resilience techniques at both the software and hardware levels under the big data context. To assist software developers in writing fault-resilient big data applications, we propose the tool ErrorSight, which helps them to focus their efforts on code regions and data structures that are most vulnerable to soft errors, understand how numerical errors propagate through the program, and apply fault resilience techniques effectively. ErrorSight achieves this through efficient generation of error profiles leveraging the predictive power of the Boosted Regression Tree model. We use four big data kernels to illustrate the modular analysis mechanism of ErrorSight and show its usefulness in the development of numerical fault-resilience in Big Data.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
3.
go back to reference Austin T (1999) Diva: a reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO 1999) Austin T (1999) Diva: a reliable substrate for deep submicron microarchitecture design. In: Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO 1999)
4.
go back to reference Baumann RC (2005) Radiation-induced soft errors in advanced semiconductor technologies. In: IEEE Transactions on Device and Materials Reliability, vol 5 Baumann RC (2005) Radiation-induced soft errors in advanced semiconductor technologies. In: IEEE Transactions on Device and Materials Reliability, vol 5
5.
go back to reference Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, USAMATH Breiman L, Friedman J, Stone CJ, Olshen RA (1984) Classification and regression trees. CRC Press, USAMATH
6.
go back to reference Cappello F, Geist A, Gropp B, Kale S, Kramer B, Snir M (2009) Toward exascale resilience. In: International Journal of High Performance Computing Applications Cappello F, Geist A, Gropp B, Kale S, Kramer B, Snir M (2009) Toward exascale resilience. In: International Journal of High Performance Computing Applications
7.
go back to reference Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC12) Chung J, Lee I, Sullivan M, Ryoo JH, Kim DW, Yoon DH, Kaplan L, Erez M (2012) Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC12)
8.
go back to reference Du P, Luszczek P, Dongarra J (2012) High performance dense linear system solver with resilience to multiple soft errors. Procedia Comput Sci 9:216–225CrossRef Du P, Luszczek P, Dongarra J (2012) High performance dense linear system solver with resilience to multiple soft errors. Procedia Comput Sci 9:216–225CrossRef
9.
go back to reference Elliott J, Hoemmen M, Mueller F (2014) Evaluating the impact of SDC on the GMRES Iterative Solver. In: Proceedings of the 28th International Parallel and Distributed Processing Symposium (IPDPS 2014) Elliott J, Hoemmen M, Mueller F (2014) Evaluating the impact of SDC on the GMRES Iterative Solver. In: Proceedings of the 28th International Parallel and Distributed Processing Symposium (IPDPS 2014)
10.
go back to reference Goncalo Amador AG (2009) Linear solvers for stable fluids: GPU vs CPU. In: 17th Encontro Portugues de Computacao Grafica (EPCG’09) Goncalo Amador AG (2009) Linear solvers for stable fluids: GPU vs CPU. In: 17th Encontro Portugues de Computacao Grafica (EPCG’09)
11.
go back to reference Huang KH, Abraham JA (1984) Algorithm-based fault tolerance for matrix operations. In: IEEE Transactions on Computers, vol C-33 Huang KH, Abraham JA (1984) Algorithm-based fault tolerance for matrix operations. In: IEEE Transactions on Computers, vol C-33
12.
go back to reference Kumar S, Hari S, Adve SV, Naeimi H, Ramachandran P (2012) Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults. In: Proceedings of the 17th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2012) Kumar S, Hari S, Adve SV, Naeimi H, Ramachandran P (2012) Relyzer: exploiting application-level fault equivalence to analyze application resiliency to transient faults. In: Proceedings of the 17th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2012)
13.
go back to reference Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO 2004). San Jose Lattner C, Adve V (2004) LLVM: a compilation framework for lifelong program analysis and transformation. In: Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO 2004). San Jose
15.
go back to reference Murphy RC, Wheeler KB, Barrett BW, Ang JA (2010) Introducing the Graph 500. Cray Users Group (CUG) Murphy RC, Wheeler KB, Barrett BW, Ang JA (2010) Introducing the Graph 500. Cray Users Group (CUG)
16.
go back to reference Nanya T, Goosen H (1989) The Byzantine hardware fault model. IEEE Trans Comput Aided Design Integr Circuits Syst 8:1226–1231CrossRef Nanya T, Goosen H (1989) The Byzantine hardware fault model. IEEE Trans Comput Aided Design Integr Circuits Syst 8:1226–1231CrossRef
17.
go back to reference Rubner Y, Tomasi C, Guibas L (1998) A metric for distributions with applications to image databases. In: Proceedings of the Sixth International Conference on Computer Vision (ICCV 1998), pp 59–66 Rubner Y, Tomasi C, Guibas L (1998) A metric for distributions with applications to image databases. In: Proceedings of the Sixth International Conference on Computer Vision (ICCV 1998), pp 59–66
18.
go back to reference Schroeder B, Pinheiro E, Weber WD (2009) DRAM errors in the wild: a large-scale field study. In: Proceedings of SIGMETRICS Schroeder B, Pinheiro E, Weber WD (2009) DRAM errors in the wild: a large-scale field study. In: Proceedings of SIGMETRICS
19.
go back to reference Stott DT, Floering B, Burke D, Kalbarczyk Z, Iyer RK (2000) NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors. In: Proceedings of the 2000 IEEE International Computer Performance and Dependability Symposium (IPDS 2000) Stott DT, Floering B, Burke D, Kalbarczyk Z, Iyer RK (2000) NFTAPE: a framework for assessing dependability in distributed systems with lightweight fault injectors. In: Proceedings of the 2000 IEEE International Computer Performance and Dependability Symposium (IPDS 2000)
20.
go back to reference Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) BigDataBench: a big data benchmark suite from internet services. In: Proceedings of the 20th International Symposium on High-Performance Computer Architecture (HPCA 2014) Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S, Zheng C, Lu G, Zhan K, Li X, Qiu B (2014) BigDataBench: a big data benchmark suite from internet services. In: Proceedings of the 20th International Symposium on High-Performance Computer Architecture (HPCA 2014)
Metadata
Title
Soft error resilience in Big Data kernels through modular analysis
Authors
Sui Chen
Greg Bronevetsky
Lu Peng
Bin Li
Xin Fu
Publication date
01-04-2016
Publisher
Springer US
Published in
The Journal of Supercomputing / Issue 4/2016
Print ISSN: 0920-8542
Electronic ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-016-1682-2

Other articles of this Issue 4/2016

The Journal of Supercomputing 4/2016 Go to the issue

Premium Partner