Skip to main content
Erschienen in: The Journal of Supercomputing 1/2013

01.10.2013

Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments

verfasst von: Dawei Sun, Guiran Chang, Changsheng Miao, Xingwei Wang

Erschienen in: The Journal of Supercomputing | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Failures are normal rather than exceptional in cloud computing environments, high fault tolerance issue is one of the major obstacles for opening up a new era of high serviceability cloud computing as fault tolerance plays a key role in ensuring cloud serviceability. Fault tolerant service is an essential part of Service Level Objectives (SLOs) in clouds. To achieve high level of cloud serviceability and to meet high level of cloud SLOs, a foolproof fault tolerance strategy is needed. In this paper, the definitions of fault, error, and failure in a cloud are given, and the principles for high fault tolerance objectives are systematically analyzed by referring to the fault tolerance theories suitable for large-scale distributed computing environments. Based on the principles and semantics of cloud fault tolerance, a dynamic adaptive fault tolerance strategy DAFT is put forward. It includes: (i) analyzing the mathematical relationship between different failure rates and two different fault tolerance strategies, which are checkpointing fault tolerance strategy and data replication fault tolerance strategy; (ii) building a dynamic adaptive checkpointing fault tolerance model and a dynamic adaptive replication fault tolerance model by combining the two fault tolerance models together to maximize the serviceability and meet the SLOs; and (iii) evaluating the dynamic adaptive fault tolerance strategy under various conditions in large-scale cloud data centers and consider different system centric parameters, such as fault tolerance degree, fault tolerance overhead, response time, etc. Theoretical as well as experimental results conclusively demonstrate that the dynamic adaptive fault tolerance strategy DAFT has high potential as it provides efficient fault tolerance enhancements, significant cloud serviceability improvement, and great SLOs satisfaction. It efficiently and effectively achieves a trade-off for fault tolerance objectives in cloud computing environments.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Marston S, Li Z, Bandyopadhyay S, Zhang J, Ghalsasi A (2011) Cloud computing—the business perspective. Decis Support Syst 51(1):176–189 CrossRef Marston S, Li Z, Bandyopadhyay S, Zhang J, Ghalsasi A (2011) Cloud computing—the business perspective. Decis Support Syst 51(1):176–189 CrossRef
2.
Zurück zum Zitat Buyya R, Chee Shin Y, Venugopal S, Broberg J, Brandic I (2009) Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Future Gener Comput Syst 25(6):599–616 CrossRef Buyya R, Chee Shin Y, Venugopal S, Broberg J, Brandic I (2009) Cloud computing and emerging IT platforms: vision, hype, and reality for delivering computing as the 5th utility. Future Gener Comput Syst 25(6):599–616 CrossRef
3.
Zurück zum Zitat Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. Commun ACM 53(4):50–58 CrossRef Armbrust M, Fox A, Griffith R, Joseph AD, Katz R, Konwinski A, Lee G, Patterson D, Rabkin A, Stoica I, Zaharia M (2010) A view of cloud computing. Commun ACM 53(4):50–58 CrossRef
4.
Zurück zum Zitat Mell P, Grance T (2010) The NIST definition of cloud computing. Commun ACM 53(6):50 Mell P, Grance T (2010) The NIST definition of cloud computing. Commun ACM 53(6):50
5.
Zurück zum Zitat Iosup A, Ostermann S, Yigitbasi MN, Prodan R, Fahringer T, Epema DHJ (2011) Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Trans Parallel Distrib Syst 22(6):931–945 CrossRef Iosup A, Ostermann S, Yigitbasi MN, Prodan R, Fahringer T, Epema DHJ (2011) Performance analysis of cloud computing services for many-tasks scientific computing. IEEE Trans Parallel Distrib Syst 22(6):931–945 CrossRef
6.
Zurück zum Zitat Xu BM, Zhao CY, Hu EZ, Hu B (2011) Job scheduling algorithm based on Berger model in cloud environment. Adv Eng Softw 42(7):419–425 CrossRef Xu BM, Zhao CY, Hu EZ, Hu B (2011) Job scheduling algorithm based on Berger model in cloud environment. Adv Eng Softw 42(7):419–425 CrossRef
7.
Zurück zum Zitat Zhu Q, Agrawal G (2010) Supporting fault-tolerance for time-critical events in distributed environments. Sci Program 18(1):51–76 Zhu Q, Agrawal G (2010) Supporting fault-tolerance for time-critical events in distributed environments. Sci Program 18(1):51–76
8.
Zurück zum Zitat Zhang Y, Zheng Z, Lyu MR (2011) BFTCloud: a byzantine fault tolerance framework for voluntary-resource cloud computing. In: Proc 2011 IEEE 4th international conference on cloud computing (CLOUD 2011), Jul. 2011. IEEE Press, New York, pp 444–451 CrossRef Zhang Y, Zheng Z, Lyu MR (2011) BFTCloud: a byzantine fault tolerance framework for voluntary-resource cloud computing. In: Proc 2011 IEEE 4th international conference on cloud computing (CLOUD 2011), Jul. 2011. IEEE Press, New York, pp 444–451 CrossRef
9.
Zurück zum Zitat Okorafor E (2011) A fault-tolerant high performance cloud strategy for scientific computing. In: Proc 2011 IEEE international symposium on parallel & distributed processing, workshops and Phd forum, May 2011. IEEE Press, New York, pp 1525–1532 CrossRef Okorafor E (2011) A fault-tolerant high performance cloud strategy for scientific computing. In: Proc 2011 IEEE international symposium on parallel & distributed processing, workshops and Phd forum, May 2011. IEEE Press, New York, pp 1525–1532 CrossRef
10.
Zurück zum Zitat Zheng Z, Zhou TC, Lyu MR, King I (2010) FTCloud: a component ranking framework for fault-tolerant cloud applications. In: Proc 2010 IEEE 21st international symposium on software reliability engineering (ISSRE 2010), Nov. 2010. IEEE Press, New York, pp 398–407 CrossRef Zheng Z, Zhou TC, Lyu MR, King I (2010) FTCloud: a component ranking framework for fault-tolerant cloud applications. In: Proc 2010 IEEE 21st international symposium on software reliability engineering (ISSRE 2010), Nov. 2010. IEEE Press, New York, pp 398–407 CrossRef
11.
Zurück zum Zitat Li Y, Lan Z (2011) FREM: a fast restart mechanism for general checkpoint/restart. IEEE Trans Comput 60(5):639–652 MathSciNetCrossRef Li Y, Lan Z (2011) FREM: a fast restart mechanism for general checkpoint/restart. IEEE Trans Comput 60(5):639–652 MathSciNetCrossRef
12.
Zurück zum Zitat Brogan J (2010) Expand your pareto principle 80–20 metrics can evaluate viability of numerous endeavors. Ind Eng 42(11):45–49 Brogan J (2010) Expand your pareto principle 80–20 metrics can evaluate viability of numerous endeavors. Ind Eng 42(11):45–49
13.
Zurück zum Zitat Luo Y, Manivannan D (2011) Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families. Perform Eval 68(5):429–445 CrossRef Luo Y, Manivannan D (2011) Theoretical and experimental evaluation of communication-induced checkpointing protocols in FE and FLazy-E families. Perform Eval 68(5):429–445 CrossRef
14.
Zurück zum Zitat Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. Oper Syst Rev 37(5):29–43 CrossRef Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. Oper Syst Rev 37(5):29–43 CrossRef
15.
Zurück zum Zitat Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proc 2010 IEEE 26th symposium on mass storage systems and technologies (MSST 2010), May 2010. IEEE Press, New York, pp 1–10 CrossRef Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proc 2010 IEEE 26th symposium on mass storage systems and technologies (MSST 2010), May 2010. IEEE Press, New York, pp 1–10 CrossRef
16.
Zurück zum Zitat Qureshi K, Khan FG, Manuel P, Nazir B (2011) A hybrid fault tolerance technique in grid computing system. J Supercomput 56(1):106–128 CrossRef Qureshi K, Khan FG, Manuel P, Nazir B (2011) A hybrid fault tolerance technique in grid computing system. J Supercomput 56(1):106–128 CrossRef
17.
Zurück zum Zitat Chtepen M, Claeys FHA, Dhoedt B, De Turck F, Demeester P, Vanrolleghem PA (2009) Adaptive task checkpointing and replication: toward efficient fault-tolerant grids. IEEE Trans Parallel Distrib Syst 20(2):180–190 CrossRef Chtepen M, Claeys FHA, Dhoedt B, De Turck F, Demeester P, Vanrolleghem PA (2009) Adaptive task checkpointing and replication: toward efficient fault-tolerant grids. IEEE Trans Parallel Distrib Syst 20(2):180–190 CrossRef
18.
Zurück zum Zitat Wang C, Mueller F, Engelmann C, Scott SL (2010) Hybrid checkpointing for MPI jobs in HPC environments. In: Proc 16th international conference on parallel and distributed systems (ICPADS 2010), Dec. 2010. IEEE Press, New York, pp 524–533 CrossRef Wang C, Mueller F, Engelmann C, Scott SL (2010) Hybrid checkpointing for MPI jobs in HPC environments. In: Proc 16th international conference on parallel and distributed systems (ICPADS 2010), Dec. 2010. IEEE Press, New York, pp 524–533 CrossRef
19.
Zurück zum Zitat Wang SS, Yan KQ, Wang SC (2011) Achieving efficient agreement within a dual-failure cloud-computing environment. Expert Syst Appl 38(1):906–915 CrossRef Wang SS, Yan KQ, Wang SC (2011) Achieving efficient agreement within a dual-failure cloud-computing environment. Expert Syst Appl 38(1):906–915 CrossRef
20.
Zurück zum Zitat Chen CH, Ting Y, Heh JS (2010) Low overhead incremental checkpointing and rollback recovery scheme on Windows operating system. In: Proc 2010 3rd international conference on knowledge discovery and data mining (WKDD 2010), Jan. 2010. IEEE Press, New York, pp 268–271 CrossRef Chen CH, Ting Y, Heh JS (2010) Low overhead incremental checkpointing and rollback recovery scheme on Windows operating system. In: Proc 2010 3rd international conference on knowledge discovery and data mining (WKDD 2010), Jan. 2010. IEEE Press, New York, pp 268–271 CrossRef
21.
Zurück zum Zitat Naksinehaboon N, Paun M, Nassar R, Leangsuksun B, Scott S (2009) High performance computing systems with various checkpointing schemes. Int J Comput Commun Control 4(4):386–400 Naksinehaboon N, Paun M, Nassar R, Leangsuksun B, Scott S (2009) High performance computing systems with various checkpointing schemes. Int J Comput Commun Control 4(4):386–400
22.
Zurück zum Zitat Lotfi M, Motamedi SA (2010) Adaptive two-level blocking coordinated checkpointing for high performance cluster computing systems. J Inf Sci Eng 26(3):951–966 Lotfi M, Motamedi SA (2010) Adaptive two-level blocking coordinated checkpointing for high performance cluster computing systems. J Inf Sci Eng 26(3):951–966
23.
Zurück zum Zitat Garg R, Garg VK, Sabharwal Y (2010) Efficient algorithms for global snapshots in large distributed systems. IEEE Trans Parallel Distrib Syst 21(5):620–630 CrossRef Garg R, Garg VK, Sabharwal Y (2010) Efficient algorithms for global snapshots in large distributed systems. IEEE Trans Parallel Distrib Syst 21(5):620–630 CrossRef
24.
Zurück zum Zitat Menderico RM, Garcia IC (2010) Diskless checkpointing with rollback-dependency trackability. In: Proc 2010 29th IEEE international symposium on reliable distributed systems (SRDS 2010), Nov. 2010. IEEE Press, New York, pp 275–281 CrossRef Menderico RM, Garcia IC (2010) Diskless checkpointing with rollback-dependency trackability. In: Proc 2010 29th IEEE international symposium on reliable distributed systems (SRDS 2010), Nov. 2010. IEEE Press, New York, pp 275–281 CrossRef
25.
Zurück zum Zitat Chiu GM, Chiu JF (2011) A new diskless checkpointing approach for multiple processor failures. IEEE Trans Dependable Secure Comput 8(4):481–493 CrossRef Chiu GM, Chiu JF (2011) A new diskless checkpointing approach for multiple processor failures. IEEE Trans Dependable Secure Comput 8(4):481–493 CrossRef
26.
Zurück zum Zitat Ling Y, Mi J, Lin X (2001) A variational calculus approach to optimal checkpoint placement. IEEE Trans Comput 50(7):699–708 CrossRef Ling Y, Mi J, Lin X (2001) A variational calculus approach to optimal checkpoint placement. IEEE Trans Comput 50(7):699–708 CrossRef
27.
Zurück zum Zitat Lei M, Vrbsky SV, Hong X (2008) An on-line replication strategy to increase availability in data grids. Future Gener Comput Syst 24(2):85–98 CrossRefMATH Lei M, Vrbsky SV, Hong X (2008) An on-line replication strategy to increase availability in data grids. Future Gener Comput Syst 24(2):85–98 CrossRefMATH
28.
Zurück zum Zitat Chang RS, Chang HP (2008) A dynamic data replication strategy using access-weights in data grids. J Supercomput 45(3):277–295 CrossRef Chang RS, Chang HP (2008) A dynamic data replication strategy using access-weights in data grids. J Supercomput 45(3):277–295 CrossRef
29.
Zurück zum Zitat Yuan D, Yang Y, Liu X, Chen J (2010) A data placement strategy in scientific cloud workflows. Future Gener Comput Syst 26(8):1200–1214 CrossRef Yuan D, Yang Y, Liu X, Chen J (2010) A data placement strategy in scientific cloud workflows. Future Gener Comput Syst 26(8):1200–1214 CrossRef
30.
Zurück zum Zitat Ray I, Ray I, Chakraborty S (2009) An interoperable context sensitive model of trust. J Intell Inf Syst 32(1):75–104 CrossRef Ray I, Ray I, Chakraborty S (2009) An interoperable context sensitive model of trust. J Intell Inf Syst 32(1):75–104 CrossRef
31.
Zurück zum Zitat Tu M, Li P, Yen IL, Thuraisingham BM, Khan L (2010) Secure data objects replication in data grid. IEEE Trans Dependable Secure Comput 7(1):50–64 CrossRef Tu M, Li P, Yen IL, Thuraisingham BM, Khan L (2010) Secure data objects replication in data grid. IEEE Trans Dependable Secure Comput 7(1):50–64 CrossRef
32.
Zurück zum Zitat Wang JY, Jea KF (2009) A near-optimal database allocation for reducing the average waiting time in the grid computing environment. Inf Sci 179(21):3772–3790 MathSciNetCrossRefMATH Wang JY, Jea KF (2009) A near-optimal database allocation for reducing the average waiting time in the grid computing environment. Inf Sci 179(21):3772–3790 MathSciNetCrossRefMATH
33.
Zurück zum Zitat Jung D, Chin SH, Chung KS, Suh T, Yu HC, Gil JM (2010) An effective job replication technique based on reliability and performance in mobile grids. In: Proc the 5th international conference advances in grid and pervasive computing (GPC 2010), May 2010. Springer, Berlin, pp 47–58 CrossRef Jung D, Chin SH, Chung KS, Suh T, Yu HC, Gil JM (2010) An effective job replication technique based on reliability and performance in mobile grids. In: Proc the 5th international conference advances in grid and pervasive computing (GPC 2010), May 2010. Springer, Berlin, pp 47–58 CrossRef
34.
Zurück zum Zitat Kim YH, Jung MJ, Lee CH (2010) Energy-aware real-time task scheduling exploiting temporal locality. IEICE Trans Inf Syst 93-D:1147–1153 CrossRef Kim YH, Jung MJ, Lee CH (2010) Energy-aware real-time task scheduling exploiting temporal locality. IEICE Trans Inf Syst 93-D:1147–1153 CrossRef
35.
Zurück zum Zitat Liu H, Jin H, Liao X, Yu C, Xu CZ (2011) Live virtual machine migration via asynchronous replication and state synchronization. IEEE Trans Parallel Distrib Syst 22(12):1986–1999 CrossRef Liu H, Jin H, Liao X, Yu C, Xu CZ (2011) Live virtual machine migration via asynchronous replication and state synchronization. IEEE Trans Parallel Distrib Syst 22(12):1986–1999 CrossRef
36.
Zurück zum Zitat Khan FG, Qureshi K, Nazir B (2010) Performance evaluation of fault tolerance techniques in grid computing system. Comput Electr Eng 36(6):1110–1122 CrossRefMATH Khan FG, Qureshi K, Nazir B (2010) Performance evaluation of fault tolerance techniques in grid computing system. Comput Electr Eng 36(6):1110–1122 CrossRefMATH
37.
Zurück zum Zitat Marzouk S, Jmaiel M (2011) A survey on software checkpointing and mobility techniques in distributed systems. Concurr Comput 23(11):1196–1212 CrossRef Marzouk S, Jmaiel M (2011) A survey on software checkpointing and mobility techniques in distributed systems. Concurr Comput 23(11):1196–1212 CrossRef
38.
Zurück zum Zitat Ma Z, Krings AW (2011) Dynamic hybrid fault modeling and extended evolutionary game theory for reliability, survivability and fault tolerance analyses. IEEE Trans Reliab 60(1):180–196 CrossRef Ma Z, Krings AW (2011) Dynamic hybrid fault modeling and extended evolutionary game theory for reliability, survivability and fault tolerance analyses. IEEE Trans Reliab 60(1):180–196 CrossRef
39.
Zurück zum Zitat Shi X, Pazat JL, Rodriguez E, Jin H, Jiang H (2010) Adapting grid applications to safety using fault-tolerant methods: design, implementation and evaluations. Future Gener Comput Syst 26(2):236–244 CrossRef Shi X, Pazat JL, Rodriguez E, Jin H, Jiang H (2010) Adapting grid applications to safety using fault-tolerant methods: design, implementation and evaluations. Future Gener Comput Syst 26(2):236–244 CrossRef
40.
Zurück zum Zitat Leu FY, Yang CT, Jiang FC (2010) Improving reliability of a heterogeneous grid-based intrusion detection platform using levels of redundancies. Future Gener Comput Syst 26(4):554–568 CrossRef Leu FY, Yang CT, Jiang FC (2010) Improving reliability of a heterogeneous grid-based intrusion detection platform using levels of redundancies. Future Gener Comput Syst 26(4):554–568 CrossRef
41.
Zurück zum Zitat Buyya R, Ranjan R, Calheiros RN (2009) Modeling and simulation of scalable cloud computing environments and the CloudSim toolkit: challenges and opportunities. In: 2009 international conference on high performance computing & simulation (HPCS), June 2009, pp 1–11 CrossRef Buyya R, Ranjan R, Calheiros RN (2009) Modeling and simulation of scalable cloud computing environments and the CloudSim toolkit: challenges and opportunities. In: 2009 international conference on high performance computing & simulation (HPCS), June 2009, pp 1–11 CrossRef
42.
Zurück zum Zitat Belalem G, Tayeb FZ, Zaoui W (2010) Approaches to improve the resources management in the simulator CloudSim. In: Proc. the first international conference information computing and applications (ICICA 2010), (Oct. 2010). Springer, Berlin, pp 189–196 Belalem G, Tayeb FZ, Zaoui W (2010) Approaches to improve the resources management in the simulator CloudSim. In: Proc. the first international conference information computing and applications (ICICA 2010), (Oct. 2010). Springer, Berlin, pp 189–196
43.
Zurück zum Zitat Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50 CrossRef Calheiros RN, Ranjan R, Beloglazov A, De Rose CAF, Buyya R (2011) CloudSim: a toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Softw Pract Exp 41(1):23–50 CrossRef
44.
Zurück zum Zitat Xu BM, Zhao CY, Hu EZ, Hu B (2011) Job scheduling algorithm based on Berger model in cloud environment. Adv Eng Softw 42(7):419–425 CrossRef Xu BM, Zhao CY, Hu EZ, Hu B (2011) Job scheduling algorithm based on Berger model in cloud environment. Adv Eng Softw 42(7):419–425 CrossRef
Metadaten
Titel
Analyzing, modeling and evaluating dynamic adaptive fault tolerance strategies in cloud computing environments
verfasst von
Dawei Sun
Guiran Chang
Changsheng Miao
Xingwei Wang
Publikationsdatum
01.10.2013
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 1/2013
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-013-0898-7

Weitere Artikel der Ausgabe 1/2013

The Journal of Supercomputing 1/2013 Zur Ausgabe

Premium Partner