Skip to main content
Top
Published in: Cluster Computing 4/2014

01-12-2014

An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems

Authors: Xiaoyong Tang, Kenli Li, Guiping Liao

Published in: Cluster Computing | Issue 4/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In large-scale heterogeneous cluster computing systems, processor and network failures are inevitable and can have an adverse effect on applications executing on such systems. One way of taking failures into account is to employ a reliable scheduling algorithm. However, most existing scheduling algorithms for precedence constrained tasks in heterogeneous systems only consider scheduling length, and not efficiently satisfy the reliability requirements of task. In recognition of this problem, we build an application reliability analysis model based on Weibull distribution, which can dynamically measure the reliability of task executing on heterogeneous cluster with arbitrary networks architectures. Then, we propose a reliability-driven earliest finish time with duplication scheduling algorithm (REFTD) which incorporates task reliability overhead into scheduling. Furthermore, to improve system reliability, it duplicates task as if task hazard rate is more than threshold \(\theta \). The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithm can shorten schedule length and improve system reliability significantly.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Bahman, J., Parimala, T., Rajkumar, B.: Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources. J. Supercomput. 63(2), 467–489 (2013)CrossRef Bahman, J., Parimala, T., Rajkumar, B.: Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources. J. Supercomput. 63(2), 467–489 (2013)CrossRef
2.
go back to reference Balasangameshwara, J., Rajub, N.: Hybrid policy for fault tolerant load balancing in grid computing environments. J. Netw. Comput. Appl. 35(1), 412–422 (2012)CrossRef Balasangameshwara, J., Rajub, N.: Hybrid policy for fault tolerant load balancing in grid computing environments. J. Netw. Comput. Appl. 35(1), 412–422 (2012)CrossRef
3.
go back to reference Ball, O.: Computational complexity of network reliability analysis: an Overview. IEEE Trans. Reliab. 35(3), 230–239 (1986)CrossRefMATH Ball, O.: Computational complexity of network reliability analysis: an Overview. IEEE Trans. Reliab. 35(3), 230–239 (1986)CrossRefMATH
4.
go back to reference Casanova, H.: Network modeling issues for grid application scheduling. Int. J. Found. Comput. 16(2), 145–162 (2005)CrossRef Casanova, H.: Network modeling issues for grid application scheduling. Int. J. Found. Comput. 16(2), 145–162 (2005)CrossRef
5.
go back to reference Das, K.: A comparative study of exponential distribution vs Weibull distribution in machine reliability analysis in a CMS design. Comput. Ind. Eng 54(1), 12–33 (2008)CrossRef Das, K.: A comparative study of exponential distribution vs Weibull distribution in machine reliability analysis in a CMS design. Comput. Ind. Eng 54(1), 12–33 (2008)CrossRef
6.
go back to reference Dogan, A., Özguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Dist. Sys 13(3), 308–323 (2002)CrossRef Dogan, A., Özguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Dist. Sys 13(3), 308–323 (2002)CrossRef
7.
go back to reference Dzmitry, K., Pascal, B., Samee, K.: DENS: data center energy-efficient network-aware scheduling. Cluster Comput. 16, 65–75 (2013)CrossRef Dzmitry, K., Pascal, B., Samee, K.: DENS: data center energy-efficient network-aware scheduling. Cluster Comput. 16, 65–75 (2013)CrossRef
8.
go back to reference Gary, M.R., Johnson, D.S.: Computers and Intractability: a Guide to the Theory of NP-Completeness. W.H. Freeman and Co, San Francisco (1979) Gary, M.R., Johnson, D.S.: Computers and Intractability: a Guide to the Theory of NP-Completeness. W.H. Freeman and Co, San Francisco (1979)
10.
go back to reference Jeannot, E., Saule, E., Trystram, D.: Optimizing performance and reliability on heterogeneous parallel systems: approximation algorithms and heuristics. J. Parallel Dist. Comput. 72(2), 268–280 (2012)CrossRefMATH Jeannot, E., Saule, E., Trystram, D.: Optimizing performance and reliability on heterogeneous parallel systems: approximation algorithms and heuristics. J. Parallel Dist. Comput. 72(2), 268–280 (2012)CrossRefMATH
11.
go back to reference Jin, H., Sun, X., Zheng, Z., Lan, Z., Xie, B.: Performance under failures of DAG\_based parallel computing. In Proceedings of the CCGrid’09, pp. 236–243 (2009). Jin, H., Sun, X., Zheng, Z., Lan, Z., Xie, B.: Performance under failures of DAG\_based parallel computing. In Proceedings of the CCGrid’09, pp. 236–243 (2009).
12.
go back to reference Khan, A.: Scheduling for heterogeneous systems using constrained critical paths. Parallel Comput. 38(4–5), 175–193 (2012)CrossRef Khan, A.: Scheduling for heterogeneous systems using constrained critical paths. Parallel Comput. 38(4–5), 175–193 (2012)CrossRef
13.
go back to reference Kwok, Y.-K., Ahmad, I.: Dynamic critical-path scheduling: an effective technique for allocating task graphs onto multiprocessors. IEEE Trans. Parallel Dist. Sys. 7(5), 506–521 (1996)CrossRef Kwok, Y.-K., Ahmad, I.: Dynamic critical-path scheduling: an effective technique for allocating task graphs onto multiprocessors. IEEE Trans. Parallel Dist. Sys. 7(5), 506–521 (1996)CrossRef
14.
go back to reference Li, R., Zhang, Y., Xu, Z., Wu, H.: A load-balancing method for network GISs in a heterogeneous cluster-based system using access density. Future Gener. Comput. Sys. 29(2), 528–535 (2013)CrossRef Li, R., Zhang, Y., Xu, Z., Wu, H.: A load-balancing method for network GISs in a heterogeneous cluster-based system using access density. Future Gener. Comput. Sys. 29(2), 528–535 (2013)CrossRef
15.
go back to reference Litke, A., Skoutas, D., Tserpes, K., Varvarigou, T.: Efficient task replication and management for adaptive fault tolerance in mobile grid environments. Future Gener. Comput. Syst. 23(2), 163–178 (2007)CrossRef Litke, A., Skoutas, D., Tserpes, K., Varvarigou, T.: Efficient task replication and management for adaptive fault tolerance in mobile grid environments. Future Gener. Comput. Syst. 23(2), 163–178 (2007)CrossRef
16.
go back to reference Macey, B.S., Zomaya, A.Y.: A performance evaluation of CP list scheduling heuristics for communication intensive task graphs. In: Parallel Processing Symposium, pp. 538–541 (1998). Macey, B.S., Zomaya, A.Y.: A performance evaluation of CP list scheduling heuristics for communication intensive task graphs. In: Parallel Processing Symposium, pp. 538–541 (1998).
17.
go back to reference Prabhakar, M.D.N., Bulmerc, M., Eccleston, A.: Weibull model selection for reliability modelling. Reliab. Eng. Sys. Safety 86(3), 257–267 (2004)CrossRef Prabhakar, M.D.N., Bulmerc, M., Eccleston, A.: Weibull model selection for reliability modelling. Reliab. Eng. Sys. Safety 86(3), 257–267 (2004)CrossRef
18.
go back to reference Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Dist. Comput. 65(8), 885–900 (2005)CrossRefMATH Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Dist. Comput. 65(8), 885–900 (2005)CrossRefMATH
19.
go back to reference Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Symposium on Dependable Systems and Networks (DSN 2006), pp. 249–258 (2006). Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Symposium on Dependable Systems and Networks (DSN 2006), pp. 249–258 (2006).
20.
go back to reference Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous machine architectures. IEEE Trans. Parallel Distrib. Sys. 49(2), 175–187 (1993)CrossRef Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous machine architectures. IEEE Trans. Parallel Distrib. Sys. 49(2), 175–187 (1993)CrossRef
21.
go back to reference Sinnen, O., Sousa, L.A., Sandnes, E.: Toward a realistic task scheduling model. IEEE Trans. Parallel Dist. Sys. 17(3), 263–275 (2006)CrossRef Sinnen, O., Sousa, L.A., Sandnes, E.: Toward a realistic task scheduling model. IEEE Trans. Parallel Dist. Sys. 17(3), 263–275 (2006)CrossRef
22.
go back to reference Tang, X., Li, K.: PADUA D.: communication contention in APN list scheduling algorithm. Info. Sci. 53(1), 59–69 (2009) Tang, X., Li, K.: PADUA D.: communication contention in APN list scheduling algorithm. Info. Sci. 53(1), 59–69 (2009)
23.
go back to reference Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Dist. Comput. 70(9), 941–952 (2010) Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Dist. Comput. 70(9), 941–952 (2010)
24.
go back to reference Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Sys. 13(3), 260–274 (2002)CrossRef Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Sys. 13(3), 260–274 (2002)CrossRef
25.
go back to reference Ye, Z., Xie, M., Tang, L.: Reliability evaluation of hard disk drive failures based on counting processes. Reliability Engineering & System Safety 109, 110–118 (2013) Ye, Z., Xie, M., Tang, L.: Reliability evaluation of hard disk drive failures based on counting processes. Reliability Engineering & System Safety 109, 110–118 (2013)
26.
go back to reference Zhang, X., Pham, H.: Software field failure rate prediction before software deployment. J. Sys. Softw. 79(3), 291–300 (2006)CrossRef Zhang, X., Pham, H.: Software field failure rate prediction before software deployment. J. Sys. Softw. 79(3), 291–300 (2006)CrossRef
27.
go back to reference Zhang, Y., Mueller, F.: Autogeneration and autotuning of 3d stencil codes on homogeneous and heterogeneous gpu clusters. IEEE Trans. Parallel Distrib. Sys. 24(3), 417–427 (2013)CrossRef Zhang, Y., Mueller, F.: Autogeneration and autotuning of 3d stencil codes on homogeneous and heterogeneous gpu clusters. IEEE Trans. Parallel Distrib. Sys. 24(3), 417–427 (2013)CrossRef
28.
go back to reference Zhao, H., Sakellariou, R.: An experimental investigation into the rank function of the heterogeneous earliest finish time scheduling algorithm. In: Proceedings of 9th International Euro-Par Conference, LNCS 2790, pp. 189–194 (2003). Zhao, H., Sakellariou, R.: An experimental investigation into the rank function of the heterogeneous earliest finish time scheduling algorithm. In: Proceedings of 9th International Euro-Par Conference, LNCS 2790, pp. 189–194 (2003).
29.
go back to reference Zheng, Q., Veeravalli, B., Tham, C.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)CrossRefMathSciNet Zheng, Q., Veeravalli, B., Tham, C.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)CrossRefMathSciNet
30.
go back to reference Zhu, X., Ge, R., Sun, J., He, C.: 3E: energy-efficient elastic scheduling for independent tasks in heterogeneous computing systems. J. Sys. Softw. 8(2), 302–314 (2013) Zhu, X., Ge, R., Sun, J., He, C.: 3E: energy-efficient elastic scheduling for independent tasks in heterogeneous computing systems. J. Sys. Softw. 8(2), 302–314 (2013)
Metadata
Title
An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems
Authors
Xiaoyong Tang
Kenli Li
Guiping Liao
Publication date
01-12-2014
Publisher
Springer US
Published in
Cluster Computing / Issue 4/2014
Print ISSN: 1386-7857
Electronic ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-014-0372-1

Other articles of this Issue 4/2014

Cluster Computing 4/2014 Go to the issue

Premium Partner