Skip to main content
Top
Published in: Computing and Visualization in Science 2-3/2017

22-12-2016

An error-resilient redundant subspace correction method

Authors: Tao Cui, Jinchao Xu, Chen-Song Zhang

Published in: Computing and Visualization in Science | Issue 2-3/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Due to increasing complexity of supercomputers, hard and soft errors are causing more and more problems in high-performance scientific and engineering computation. In order to improve reliability (increase the mean time to failure) of computing systems, a lot of efforts have been devoted to developing techniques to forecast, prevent, and recover from errors at different levels, including architecture, application, and algorithm. In this paper, we focus on algorithmic error resilient iterative solvers and introduce a redundant subspace correction method. Using a general framework of redundant subspace corrections, we construct iterative methods, which have the following properties: (1) maintain convergence when error occurs assuming it is detectable; (2) introduce low computational overhead when no error occurs; (3) require only small amount of point-to-point communication compared to traditional methods and maintain good load balance; (4) improve the mean time to failure. Preliminary numerical experiments demonstrate the efficiency and effectiveness of the new subspace correction method. For simplicity, the main ideas of the proposed framework were demonstrated using the Schwarz methods without a coarse space, which do not scale well in practice.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
The y-axis is processing units and the x-axis is time. The solid bars stand for computational work and springs stand for inter-process communication.
 
Literature
1.
go back to reference Abts, D., Thompson, J., Schwoerer, G.: Architectural Support for Mitigating Dram Soft Errors in Large-Scale Supercomputers. Tech. rep. (2006) Abts, D., Thompson, J., Schwoerer, G.: Architectural Support for Mitigating Dram Soft Errors in Large-Scale Supercomputers. Tech. rep. (2006)
2.
go back to reference Bjorstad, P.E., Skogen, M.: Domain decomposition algorithms of schwarz type, designed for massively parallel computers. In: 5th International Symposium on Domain Decomposition Methods for Partial Differential Equations. SIAM, Philadelphia, pp. 362–375 (1992) Bjorstad, P.E., Skogen, M.: Domain decomposition algorithms of schwarz type, designed for massively parallel computers. In: 5th International Symposium on Domain Decomposition Methods for Partial Differential Equations. SIAM, Philadelphia, pp. 362–375 (1992)
3.
go back to reference Boley, D.L., Brent, R.P., Golub, G.H., Luk, F.T.: Algorithmic fault tolerance using the lanczos method. SIAM J. Matrix Anal. Appl. 13(1), 312–332 (1992)MathSciNetCrossRefMATH Boley, D.L., Brent, R.P., Golub, G.H., Luk, F.T.: Algorithmic fault tolerance using the lanczos method. SIAM J. Matrix Anal. Appl. 13(1), 312–332 (1992)MathSciNetCrossRefMATH
5.
go back to reference Bronevetsky, G., de Supinski, B.R.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 155–164 (2008) Bronevetsky, G., de Supinski, B.R.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 155–164 (2008)
6.
go back to reference Chen, Z., Dongarra, J.: Algorithm-based fault tolerance for fail-stop failures. IEEE Trans. Parallel Distrib. Syst. 19(12), 1628–1641 (2008)CrossRef Chen, Z., Dongarra, J.: Algorithm-based fault tolerance for fail-stop failures. IEEE Trans. Parallel Distrib. Syst. 19(12), 1628–1641 (2008)CrossRef
7.
go back to reference Deng, Y.: Applied Parallel Computing. World Scientific, Singapore (2013)MATH Deng, Y.: Applied Parallel Computing. World Scientific, Singapore (2013)MATH
8.
go back to reference Dongarra, J., Beckman, P., Moore, T., Aerts, P., Aloisio, G., Andre, J.C., Barkai, D., Berthou, J.Y., Boku, T., Braunschweig, B., Cappello, F., Chapman, B.: Choudhary, a., Dosanjh, S., Dunning, T., Fiore, S., Geist, a., Gropp, B., Harrison, R., Hereld, M., Heroux, M., Hoisie, a., Hotta, K., Ishikawa, Y., Johnson, F., Kale, S., Kenway, R., Keyes, D., Kramer, B., Labarta, J., Lichnewsky, a., Lippert, T., Lucas, B., Maccabe, B., Matsuoka, S., Messina, P., Michielse, P., Mohr, B., Mueller, M.S., Nagel, W.E., Nakashima, H., Papka, M.E., Reed, D., Sato, M., Seidel, E., Shalf, J., Skinner, D., Snir, M., Sterling, T., Stevens, R., Streitz, F., Sugar, B., Sumimoto, S., Tang, W., Taylor, J., Thakur, R., Trefethen, a., Valero, M., van der Steen, a., Vetter, J., Williams, P., Wisniewski, R., Yelick, K.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011). doi:10.1177/1094342010391989 Dongarra, J., Beckman, P., Moore, T., Aerts, P., Aloisio, G., Andre, J.C., Barkai, D., Berthou, J.Y., Boku, T., Braunschweig, B., Cappello, F., Chapman, B.: Choudhary, a., Dosanjh, S., Dunning, T., Fiore, S., Geist, a., Gropp, B., Harrison, R., Hereld, M., Heroux, M., Hoisie, a., Hotta, K., Ishikawa, Y., Johnson, F., Kale, S., Kenway, R., Keyes, D., Kramer, B., Labarta, J., Lichnewsky, a., Lippert, T., Lucas, B., Maccabe, B., Matsuoka, S., Messina, P., Michielse, P., Mohr, B., Mueller, M.S., Nagel, W.E., Nakashima, H., Papka, M.E., Reed, D., Sato, M., Seidel, E., Shalf, J., Skinner, D., Snir, M., Sterling, T., Stevens, R., Streitz, F., Sugar, B., Sumimoto, S., Tang, W., Taylor, J., Thakur, R., Trefethen, a., Valero, M., van der Steen, a., Vetter, J., Williams, P., Wisniewski, R., Yelick, K.: The international exascale software project roadmap. Int. J. High Perform. Comput. Appl. 25(1), 3–60 (2011). doi:10.​1177/​1094342010391989​
9.
go back to reference Dryja, M., Widlund, O.: Some domain decomposition algorithms for elliptic problems. In: Hayes, L., Kincaid, D. (eds.) Iterative Methods for Large Linear Systems, pp. 273–291. Academic Press, San Diego (1989) Dryja, M., Widlund, O.: Some domain decomposition algorithms for elliptic problems. In: Hayes, L., Kincaid, D. (eds.) Iterative Methods for Large Linear Systems, pp. 273–291. Academic Press, San Diego (1989)
10.
go back to reference Dryja, M., Widlund, O.B.: Additive schwarz methods for elliptic finite element problems in three dimensions. In: Fifth Conference on Domain Decomposition Methods for Partial Differential Equations, Philadelphia, PA (1992) Dryja, M., Widlund, O.B.: Additive schwarz methods for elliptic finite element problems in three dimensions. In: Fifth Conference on Domain Decomposition Methods for Partial Differential Equations, Philadelphia, PA (1992)
11.
go back to reference Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with resilience to multiple soft errors. In: International Conference on Cluster Computing, pp. 272–280 (2011) Du, P., Luszczek, P., Dongarra, J.: High performance dense linear system solver with resilience to multiple soft errors. In: International Conference on Cluster Computing, pp. 272–280 (2011)
13.
go back to reference Gropp, W.D.: Parallel computing and domain decomposition. In: Fifth Conference on Domain Decomposition Methods for Partial Differential Equations, pp. 349–361 (1992) Gropp, W.D.: Parallel computing and domain decomposition. In: Fifth Conference on Domain Decomposition Methods for Partial Differential Equations, pp. 349–361 (1992)
14.
go back to reference Hackbusch, W.: Elliptic Differential Equations: Theory and Numerical Treatment, Computational Mathematics Series. Springer, Berlin (1992)CrossRef Hackbusch, W.: Elliptic Differential Equations: Theory and Numerical Treatment, Computational Mathematics Series. Springer, Berlin (1992)CrossRef
15.
go back to reference Hackbusch, W.: Iterative Solution of Large Sparse Systems of Equations, Applied Mathematical Sciences, vol. 95. Springer, New York (1994)CrossRefMATH Hackbusch, W.: Iterative Solution of Large Sparse Systems of Equations, Applied Mathematical Sciences, vol. 95. Springer, New York (1994)CrossRefMATH
16.
go back to reference Hoemmen, M., Heroux, M.A.: Fault-tolerant iterative methods via selective reliability. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2011) Hoemmen, M., Heroux, M.A.: Fault-tolerant iterative methods via selective reliability. In: Proceedings of the 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) (2011)
17.
go back to reference Huang, K.h., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. c(6), 518–528 (1984) Huang, K.h., Abraham, J.A.: Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. c(6), 518–528 (1984)
18.
go back to reference Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)MathSciNetCrossRefMATH Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput. 20(1), 359–392 (1998)MathSciNetCrossRefMATH
20.
go back to reference Kikuchi, N.: Finite Element Methods in Mechanics. Cambridge University Press, Cambridge (1986)CrossRefMATH Kikuchi, N.: Finite Element Methods in Mechanics. Cambridge University Press, Cambridge (1986)CrossRefMATH
21.
go back to reference Langou, J., Chen, Z., Bosilca, G., Dongarra, J.: Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput. 30(1), 102–116 (2007)MathSciNetCrossRefMATH Langou, J., Chen, Z., Bosilca, G., Dongarra, J.: Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput. 30(1), 102–116 (2007)MathSciNetCrossRefMATH
22.
go back to reference Laprie, J.: Dependable computing: Concepts, limits, challenges. In: The 25th IEEE International Symposium on Fault-Tolerant Computing, pp. 42–54 (1995) Laprie, J.: Dependable computing: Concepts, limits, challenges. In: The 25th IEEE International Symposium on Fault-Tolerant Computing, pp. 42–54 (1995)
23.
go back to reference Liu, Y., Nassar, R., Leangsuksun, C.B., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–9 (2008). doi:10.1109/IPDPS.2008.4536279 Liu, Y., Nassar, R., Leangsuksun, C.B., Naksinehaboon, N., Paun, M., Scott, S.L.: An optimal checkpoint/restart model for a large scale high performance computing system. In: 2008 IEEE International Symposium on Parallel and Distributed Processing, pp. 1–9 (2008). doi:10.​1109/​IPDPS.​2008.​4536279
24.
go back to reference Luk, F., Park, H.: An analysis of algorithm-based fault tolerance techniques. In: 30th Annual Technical Symposium on International Society for Optics and Photonics, pp. 172–184 (1986) Luk, F., Park, H.: An analysis of algorithm-based fault tolerance techniques. In: 30th Annual Technical Symposium on International Society for Optics and Photonics, pp. 172–184 (1986)
25.
go back to reference Malkowski, K., Raghavan, P., Kandemir, M.: Analyzing the soft error resilience of linear solvers on multicore multiprocessors. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12 (2010). doi:10.1109/IPDPS.2010.5470411 Malkowski, K., Raghavan, P., Kandemir, M.: Analyzing the soft error resilience of linear solvers on multicore multiprocessors. In: 2010 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–12 (2010). doi:10.​1109/​IPDPS.​2010.​5470411
26.
go back to reference Michalak, S., Harris, K., Hengartner, N., Takala, B., Wender, S.: Predicting the number of fatal soft errors in Los Alamos national laboratory’s ASC Q supercomputer. IEEE Trans. Device Mater. Reliab. 5(3), 329–335 (2005). doi:10.1109/TDMR.2005.855685 CrossRef Michalak, S., Harris, K., Hengartner, N., Takala, B., Wender, S.: Predicting the number of fatal soft errors in Los Alamos national laboratory’s ASC Q supercomputer. IEEE Trans. Device Mater. Reliab. 5(3), 329–335 (2005). doi:10.​1109/​TDMR.​2005.​855685 CrossRef
27.
go back to reference Miskov-Zivanov, N., Marculescu, D.: Soft error rate analysis for sequential circuits. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1436–1441 (2007) Miskov-Zivanov, N., Marculescu, D.: Soft error rate analysis for sequential circuits. In: Proceedings of the Conference on Design, Automation and Test in Europe, pp. 1436–1441 (2007)
28.
go back to reference Monk, P.: Finite Element Methods for Maxwell’s Equations. Numerical Mathematics and Scientific Computation. Clarendon Press, Oxford (2003)CrossRef Monk, P.: Finite Element Methods for Maxwell’s Equations. Numerical Mathematics and Scientific Computation. Clarendon Press, Oxford (2003)CrossRef
29.
go back to reference Mukherjee, S., Emer, J., Reinhardt, S.K.: The soft error problem: An architectural perspective. In: Proc. 11th Int’l Symp. on High-Performance Computer Architecture (HPCA) (2005) Mukherjee, S., Emer, J., Reinhardt, S.K.: The soft error problem: An architectural perspective. In: Proc. 11th Int’l Symp. on High-Performance Computer Architecture (HPCA) (2005)
31.
go back to reference Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9(10), 972–986 (1998)CrossRef Plank, J.S., Li, K., Puening, M.A.: Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst. 9(10), 972–986 (1998)CrossRef
32.
go back to reference Reddi, V.: Hardware and software co-design for robust and resilient execution. In: 2012 International Conference on Collaboration Technologies and Systems, p. 380 (2012) Reddi, V.: Hardware and software co-design for robust and resilient execution. In: 2012 International Conference on Collaboration Technologies and Systems, p. 380 (2012)
33.
go back to reference Roy-Chowdhury, A., Banerjee, P.: A fault-tolerant parallel algorithm for iterative solution of the laplace equation. In: International Conference on Parallel Processing, vol. 3, pp. 133–140 (1993) Roy-Chowdhury, A., Banerjee, P.: A fault-tolerant parallel algorithm for iterative solution of the laplace equation. In: International Conference on Parallel Processing, vol. 3, pp. 133–140 (1993)
35.
go back to reference Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing, pp. 152–161 (2011) Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing, pp. 152–161 (2011)
36.
go back to reference Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 69–78. ACM, New York (2012) Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM International Conference on Supercomputing, pp. 69–78. ACM, New York (2012)
37.
go back to reference Smith, B.F.: A parallel implementation of an iterative substructuring algorithm for problems in three dimensions. SIAM J. Sci. Comput. 14(2), 406–423 (1993)MathSciNetCrossRefMATH Smith, B.F.: A parallel implementation of an iterative substructuring algorithm for problems in three dimensions. SIAM J. Sci. Comput. 14(2), 406–423 (1993)MathSciNetCrossRefMATH
38.
go back to reference Stoyanov, M.K., Webster, C.G.: Numerical Analysis of Fixed Point Algorithms in the Presence of Hardware Faults. Tech. rep., Oak Ridge National Laboratory (ORNL) (2013) Stoyanov, M.K., Webster, C.G.: Numerical Analysis of Fixed Point Algorithms in the Presence of Hardware Faults. Tech. rep., Oak Ridge National Laboratory (ORNL) (2013)
39.
go back to reference Toselli, A., Widlund, O.B.: Domain Decomposition Methods: Algorithms and Theory, Springer Series in Computational Mathematics, vol. 34. Springer, Berlin (2005)CrossRefMATH Toselli, A., Widlund, O.B.: Domain Decomposition Methods: Algorithms and Theory, Springer Series in Computational Mathematics, vol. 34. Springer, Berlin (2005)CrossRefMATH
40.
go back to reference Treaster, M.: A Survey of Fault-tolerance and Fault-recovery techniques in Parallel Systems. Tech. rep., ACM Computing Research Repository (2005) Treaster, M.: A Survey of Fault-tolerance and Fault-recovery techniques in Parallel Systems. Tech. rep., ACM Computing Research Repository (2005)
43.
go back to reference Zhang, W.: Computing cache vulnerability to transient errors and its implication. In: 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 427–435 (2005) Zhang, W.: Computing cache vulnerability to transient errors and its implication. In: 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, pp. 427–435 (2005)
Metadata
Title
An error-resilient redundant subspace correction method
Authors
Tao Cui
Jinchao Xu
Chen-Song Zhang
Publication date
22-12-2016
Publisher
Springer Berlin Heidelberg
Published in
Computing and Visualization in Science / Issue 2-3/2017
Print ISSN: 1432-9360
Electronic ISSN: 1433-0369
DOI
https://doi.org/10.1007/s00791-016-0270-6

Other articles of this Issue 2-3/2017

Computing and Visualization in Science 2-3/2017 Go to the issue

Premium Partner