Top

Cluster Computing

Published in:

01-12-2014

An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems

Authors: Xiaoyong Tang, Kenli Li, Guiping Liao

Published in: Cluster Computing | Issue 4/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In large-scale heterogeneous cluster computing systems, processor and network failures are inevitable and can have an adverse effect on applications executing on such systems. One way of taking failures into account is to employ a reliable scheduling algorithm. However, most existing scheduling algorithms for precedence constrained tasks in heterogeneous systems only consider scheduling length, and not efficiently satisfy the reliability requirements of task. In recognition of this problem, we build an application reliability analysis model based on Weibull distribution, which can dynamically measure the reliability of task executing on heterogeneous cluster with arbitrary networks architectures. Then, we propose a reliability-driven earliest finish time with duplication scheduling algorithm (REFTD) which incorporates task reliability overhead into scheduling. Furthermore, to improve system reliability, it duplicates task as if task hazard rate is more than threshold \(\theta \). The comparison study, based on both randomly generated graphs and the graphs of some real applications, shows that our scheduling algorithm can shorten schedule length and improve system reliability significantly.

previous article Efficient public verification proof of retrievability scheme in cloud

next article Reexamining anomaly temporal behaviors in SPEC CPU workloads: self-similar or not?

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Bahman, J., Parimala, T., Rajkumar, B.: Enhancing performance of failure-prone clusters by adaptive provisioning of cloud resources. J. Supercomput. 63(2), 467–489 (2013)CrossRef

Balasangameshwara, J., Rajub, N.: Hybrid policy for fault tolerant load balancing in grid computing environments. J. Netw. Comput. Appl. 35(1), 412–422 (2012)CrossRef

Ball, O.: Computational complexity of network reliability analysis: an Overview. IEEE Trans. Reliab. 35(3), 230–239 (1986)CrossRefMATH

Casanova, H.: Network modeling issues for grid application scheduling. Int. J. Found. Comput. 16(2), 145–162 (2005)CrossRef

Das, K.: A comparative study of exponential distribution vs Weibull distribution in machine reliability analysis in a CMS design. Comput. Ind. Eng 54(1), 12–33 (2008)CrossRef

Dogan, A., Özguner, F.: Matching and scheduling algorithms for minimizing execution time and failure probability of applications in heterogeneous computing. IEEE Trans. Parallel Dist. Sys 13(3), 308–323 (2002)CrossRef

Dzmitry, K., Pascal, B., Samee, K.: DENS: data center energy-efficient network-aware scheduling. Cluster Comput. 16, 65–75 (2013)CrossRef

Gary, M.R., Johnson, D.S.: Computers and Intractability: a Guide to the Theory of NP-Completeness. W.H. Freeman and Co, San Francisco (1979)

http://simgrid.gforge.inria.fr/. Accessed 12 Nov 2012

10.

Jeannot, E., Saule, E., Trystram, D.: Optimizing performance and reliability on heterogeneous parallel systems: approximation algorithms and heuristics. J. Parallel Dist. Comput. 72(2), 268–280 (2012)CrossRefMATH

11.

Jin, H., Sun, X., Zheng, Z., Lan, Z., Xie, B.: Performance under failures of DAG\_based parallel computing. In Proceedings of the CCGrid’09, pp. 236–243 (2009).

12.

Khan, A.: Scheduling for heterogeneous systems using constrained critical paths. Parallel Comput. 38(4–5), 175–193 (2012)CrossRef

13.

Kwok, Y.-K., Ahmad, I.: Dynamic critical-path scheduling: an effective technique for allocating task graphs onto multiprocessors. IEEE Trans. Parallel Dist. Sys. 7(5), 506–521 (1996)CrossRef

14.

Li, R., Zhang, Y., Xu, Z., Wu, H.: A load-balancing method for network GISs in a heterogeneous cluster-based system using access density. Future Gener. Comput. Sys. 29(2), 528–535 (2013)CrossRef

15.

Litke, A., Skoutas, D., Tserpes, K., Varvarigou, T.: Efficient task replication and management for adaptive fault tolerance in mobile grid environments. Future Gener. Comput. Syst. 23(2), 163–178 (2007)CrossRef

16.

Macey, B.S., Zomaya, A.Y.: A performance evaluation of CP list scheduling heuristics for communication intensive task graphs. In: Parallel Processing Symposium, pp. 538–541 (1998).

17.

Prabhakar, M.D.N., Bulmerc, M., Eccleston, A.: Weibull model selection for reliability modelling. Reliab. Eng. Sys. Safety 86(3), 257–267 (2004)CrossRef

18.

Qin, X., Jiang, H.: A dynamic and reliability-driven scheduling algorithm for parallel real-time jobs executing on heterogeneous clusters. J. Parallel Dist. Comput. 65(8), 885–900 (2005)CrossRefMATH

19.

Schroeder, B., Gibson, G.: A large-scale study of failures in high-performance computing systems. In: Proceedings of the International Symposium on Dependable Systems and Networks (DSN 2006), pp. 249–258 (2006).

20.

Sih, G.C., Lee, E.A.: A compile-time scheduling heuristic for interconnection-constrained heterogeneous machine architectures. IEEE Trans. Parallel Distrib. Sys. 49(2), 175–187 (1993)CrossRef

21.

Sinnen, O., Sousa, L.A., Sandnes, E.: Toward a realistic task scheduling model. IEEE Trans. Parallel Dist. Sys. 17(3), 263–275 (2006)CrossRef

22.

Tang, X., Li, K.: PADUA D.: communication contention in APN list scheduling algorithm. Info. Sci. 53(1), 59–69 (2009)

23.

Tang, X., Li, K., Li, R., Veeravalli, B.: Reliability-aware scheduling strategy for heterogeneous distributed computing systems. J. Parallel Dist. Comput. 70(9), 941–952 (2010)

24.

Topcuoglu, H., Hariri, S., Wu, M.-Y.: Performance-effective and low complexity task scheduling for heterogeneous computing. IEEE Trans. Parallel Distrib. Sys. 13(3), 260–274 (2002)CrossRef

25.

Ye, Z., Xie, M., Tang, L.: Reliability evaluation of hard disk drive failures based on counting processes. Reliability Engineering & System Safety 109, 110–118 (2013)

26.

Zhang, X., Pham, H.: Software field failure rate prediction before software deployment. J. Sys. Softw. 79(3), 291–300 (2006)CrossRef

27.

Zhang, Y., Mueller, F.: Autogeneration and autotuning of 3d stencil codes on homogeneous and heterogeneous gpu clusters. IEEE Trans. Parallel Distrib. Sys. 24(3), 417–427 (2013)CrossRef

28.

Zhao, H., Sakellariou, R.: An experimental investigation into the rank function of the heterogeneous earliest finish time scheduling algorithm. In: Proceedings of 9th International Euro-Par Conference, LNCS 2790, pp. 189–194 (2003).

29.

Zheng, Q., Veeravalli, B., Tham, C.: On the design of fault-tolerant scheduling strategies using primary-backup approach for computational grids with low replication costs. IEEE Trans. Comput. 58(3), 380–393 (2009)CrossRefMathSciNet

30.

Zhu, X., Ge, R., Sun, J., He, C.: 3E: energy-efficient elastic scheduling for independent tasks in heterogeneous computing systems. J. Sys. Softw. 8(2), 302–314 (2013)

Title: An effective reliability-driven technique of allocating tasks on heterogeneous cluster systems
Authors: Xiaoyong Tang
Kenli Li
Guiping Liao
Publication date: 01-12-2014
Publisher: Springer US
Published in: Cluster Computing / Issue 4/2014
Print ISSN: 1386-7857
Electronic ISSN: 1573-7543
DOI: https://doi.org/10.1007/s10586-014-0372-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2014

Assessing the impact of the CPU power-saving modes on the task-parallel solution of sparse linear systems

DIRAQ: scalable in situ data- and resource-aware indexing for optimized query performance

Improved MPI collectives for MPI processes in shared address spaces

A dynamic block device reconfiguration algorithm in virtual MapReduce cluster

Predictively booting nodes to minimize performance degradation of a power-aware web cluster

Empirical and analytical approaches for web server power modeling

Premium Partner