Skip to main content
Erschienen in: Computing 9/2020

19.02.2020

Failure prediction of tasks in the cloud at an earlier stage: a solution based on domain information mining

verfasst von: Chunhong Liu, Liping Dai, Yi Lai, Guibing Lai, Wentao Mao

Erschienen in: Computing | Ausgabe 9/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In a large-scale data center, it is vital to precisely recognize the termination statuses of applications at an early stage. In recent years, many machine learning techniques have been applied to this issue, which is beneficial for optimizing the scheduling policy and improving the efficiency of resource utilization. However, if the application’s dynamic information is insufficient at the early stage, the generalization performance of the machine learning model will be lessened, and the prediction accuracy could be low. To overcome this problem, a novel failure prediction method that is based on the association relationships between similar jobs is proposed in this paper to jointly predict task’s termination statuses at an earlier stage. The similar jobs whose tasks have similar changing modes of consumed resources, an inherent structural correlation may exist, and the correlation information is significant for improving the prediction model’s generalization performance. First, a job clustering algorithm is proposed for identifying the jobs with higher similarity from jobs that have various numbers of tasks. Second, based on the job clustering results, the robust multi-task learning algorithm is introduced to effectively utilize the domain information among jobs (i.e. interactional relationship among jobs on the termination statuses of task). Experiments are conducted on a Google cluster workload traces dataset. The results show that the proposed method can realize higher prediction accuracy, lower misjudgment rate, and higher predictive stability than several state-of-the-art methods at 1/3 the running time of the tasks.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Zhang Q, Zhani MF, Boutaba R, Hellerstein JL (2014) Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans Cloud Comput 2(1):14–28CrossRef Zhang Q, Zhani MF, Boutaba R, Hellerstein JL (2014) Dynamic heterogeneity-aware resource provisioning in the cloud. IEEE Trans Cloud Comput 2(1):14–28CrossRef
2.
Zurück zum Zitat Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes J (2015) Large-scale cluster management at Google with Borg. In: Proceedings of the tenth European conference on computer systems (In EuroSys), Bordeaux, France, pp 1–17 Verma A, Pedrosa L, Korupolu M, Oppenheimer D, Tune E, Wilkes J (2015) Large-scale cluster management at Google with Borg. In: Proceedings of the tenth European conference on computer systems (In EuroSys), Bordeaux, France, pp 1–17
3.
Zurück zum Zitat Jassas M, Mahmoud QH (2018) Failure analysis and characterization of scheduling jobs in google cluster trace. In: IECON 2018-44th annual conference of the IEEE Industrial Electronics Society Washington, pp 3102–3107 Jassas M, Mahmoud QH (2018) Failure analysis and characterization of scheduling jobs in google cluster trace. In: IECON 2018-44th annual conference of the IEEE Industrial Electronics Society Washington, pp 3102–3107
4.
Zurück zum Zitat Chen X, Lu CD, Pattabiraman K (2014) Failure analysis of jobs in compute clouds: a google cluster case study. In: Proceedings of IEEE international symposium on software reliability engineering workshops, Naples, Italy, pp 167–177 Chen X, Lu CD, Pattabiraman K (2014) Failure analysis of jobs in compute clouds: a google cluster case study. In: Proceedings of IEEE international symposium on software reliability engineering workshops, Naples, Italy, pp 167–177
5.
Zurück zum Zitat Liu HC, Han JJ, Shang Y, Liu C, Bo C, Chen J (2017) Predicting of job failure in compute cloud based on online extreme learning machine: a comparative study. IEEE Access 5(99):9359–9368CrossRef Liu HC, Han JJ, Shang Y, Liu C, Bo C, Chen J (2017) Predicting of job failure in compute cloud based on online extreme learning machine: a comparative study. IEEE Access 5(99):9359–9368CrossRef
6.
Zurück zum Zitat Mao W, He L, Yan Y, Wang J (2017) Online sequential prediction of bearings imbalanced fault diagnosis by extreme learning machine. Mech Syst Signal Process 83:450–473CrossRef Mao W, He L, Yan Y, Wang J (2017) Online sequential prediction of bearings imbalanced fault diagnosis by extreme learning machine. Mech Syst Signal Process 83:450–473CrossRef
7.
Zurück zum Zitat Wang Z, Zhang M, Wang D, Song C, Liu M, Li J, Lou L, Liu Z (2017) Failure prediction using machine learning and time series in optical network. Opt Express 25(16):18553–18565CrossRef Wang Z, Zhang M, Wang D, Song C, Liu M, Li J, Lou L, Liu Z (2017) Failure prediction using machine learning and time series in optical network. Opt Express 25(16):18553–18565CrossRef
8.
Zurück zum Zitat Rosa A, Chen LY, Binder W (2017) Failure analysis and prediction for big-data systems. IEEE Trans Serv Comput 10(6):984–998CrossRef Rosa A, Chen LY, Binder W (2017) Failure analysis and prediction for big-data systems. IEEE Trans Serv Comput 10(6):984–998CrossRef
9.
Zurück zum Zitat Ganguly S, Consul A, Khan A, Bussone B, Miguel A (2016) A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: Proceedings of IEEE second international conference on big data computing service and applications, Oxford, UK, pp 105–116 Ganguly S, Consul A, Khan A, Bussone B, Miguel A (2016) A practical approach to hard disk failure prediction in cloud platforms: big data model for failure management in datacenters. In: Proceedings of IEEE second international conference on big data computing service and applications, Oxford, UK, pp 105–116
10.
Zurück zum Zitat Padmakumari P, Umamakeswari A (2019) Task failure prediction using combine bagging ensemble (CBE) classification in cloud workflow. Wirel Pers Commun 107(1):23–40CrossRef Padmakumari P, Umamakeswari A (2019) Task failure prediction using combine bagging ensemble (CBE) classification in cloud workflow. Wirel Pers Commun 107(1):23–40CrossRef
11.
Zurück zum Zitat Chen X, Lu C, Pattabiramanb K (2014) Failure prediction of jobs in compute clouds: a google cluster case study. 2014 IEEE international symposium on software reliability engineering workshops. Naples, Italy, pp 341–346 Chen X, Lu C, Pattabiramanb K (2014) Failure prediction of jobs in compute clouds: a google cluster case study. 2014 IEEE international symposium on software reliability engineering workshops. Naples, Italy, pp 341–346
12.
Zurück zum Zitat Pei Y, Qi T, He J (2017) Multi-task function-on-function regression with co-grouping structured sparsity. In: Proceedings of ACM Sigkdd international conference on knowledge discovery and data mining, Halifax, NS, Canada, pp 1255–1264 Pei Y, Qi T, He J (2017) Multi-task function-on-function regression with co-grouping structured sparsity. In: Proceedings of ACM Sigkdd international conference on knowledge discovery and data mining, Halifax, NS, Canada, pp 1255–1264
13.
Zurück zum Zitat Liu T, Tao D, Song M, Maybank S (2017) Algorithm-dependent generalization bounds for multi-task learning. IEEE Trans Pattern Anal 39(2):227–241CrossRef Liu T, Tao D, Song M, Maybank S (2017) Algorithm-dependent generalization bounds for multi-task learning. IEEE Trans Pattern Anal 39(2):227–241CrossRef
14.
Zurück zum Zitat Liu CH, Han JJ, Shang YL (2016) Predicting job failure in cloud cluster: based on SVM classification. J Beijing Univ Posts Telecommun 39(5):104–109 Liu CH, Han JJ, Shang YL (2016) Predicting job failure in cloud cluster: based on SVM classification. J Beijing Univ Posts Telecommun 39(5):104–109
15.
Zurück zum Zitat Li Z, Tian Z, Mu Z, Zhang Z, Yue J (2018) Awareness of line-of-sight propagation for indoor localization using Hopkins statistic. IEEE Sens J 18(9):3864–3874CrossRef Li Z, Tian Z, Mu Z, Zhang Z, Yue J (2018) Awareness of line-of-sight propagation for indoor localization using Hopkins statistic. IEEE Sens J 18(9):3864–3874CrossRef
16.
Zurück zum Zitat Padmanaban S, Thiruvenkadam K (2018) Rapid brain tissue segmentation process by modified FCM algorithm with CUDA enabled GPU machine. Int J Imag Syst Technol 28(3):163–174CrossRef Padmanaban S, Thiruvenkadam K (2018) Rapid brain tissue segmentation process by modified FCM algorithm with CUDA enabled GPU machine. Int J Imag Syst Technol 28(3):163–174CrossRef
17.
Zurück zum Zitat Pan S, Shi W, He P, Ming H, Zhang X (2016) Novel approach to unsupervised change detection based on a robust semi-supervised FCM clustering algorithm. Remote Sens 8(3):264CrossRef Pan S, Shi W, He P, Ming H, Zhang X (2016) Novel approach to unsupervised change detection based on a robust semi-supervised FCM clustering algorithm. Remote Sens 8(3):264CrossRef
18.
Zurück zum Zitat Chen J, Zhou J, Ye J (2011) Integrating low-rank and groupsparse structures for robust multi-task learning. In: Proceedings of ACM Sigkdd international conference on knowledge discovery and data mining, San Diego, California, USA, pp 42–50 Chen J, Zhou J, Ye J (2011) Integrating low-rank and groupsparse structures for robust multi-task learning. In: Proceedings of ACM Sigkdd international conference on knowledge discovery and data mining, San Diego, California, USA, pp 42–50
19.
Zurück zum Zitat Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202MathSciNetCrossRef Beck A, Teboulle M (2009) A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci 2(1):183–202MathSciNetCrossRef
20.
Zurück zum Zitat Mao W, Mu X, Zheng Y, Yan G (2014) Leave-one-out cross-validationbased model selection for multi-input multi-output support vector machine. Neural Comput Appl 24(2):441–451CrossRef Mao W, Mu X, Zheng Y, Yan G (2014) Leave-one-out cross-validationbased model selection for multi-input multi-output support vector machine. Neural Comput Appl 24(2):441–451CrossRef
21.
Zurück zum Zitat Navarro JM, Parada GHA, Duenas JC (2014) System failure prediction through rare-events elastic-net logistic regression. In: Proceedings of international conference on artificial intelligence, Madrid, Spain, pp 120-125 Navarro JM, Parada GHA, Duenas JC (2014) System failure prediction through rare-events elastic-net logistic regression. In: Proceedings of international conference on artificial intelligence, Madrid, Spain, pp 120-125
22.
Zurück zum Zitat Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient l 2, 1-norm minimization. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, Montreal, Quebec, Canada, pp 339–348 Liu J, Ji S, Ye J (2009) Multi-task feature learning via efficient l 2, 1-norm minimization. In: Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence. AUAI Press, Montreal, Quebec, Canada, pp 339–348
23.
Zurück zum Zitat Pong TK, Tseng P, Ji S, Ye J (2010) Trace norm regularization: reformulations, algorithms, and multi-task learning. SIAM J Optim 20(6):3465–3489MathSciNetCrossRef Pong TK, Tseng P, Ji S, Ye J (2010) Trace norm regularization: reformulations, algorithms, and multi-task learning. SIAM J Optim 20(6):3465–3489MathSciNetCrossRef
24.
Zurück zum Zitat Belghazi I, Rajeswar S, Baratin A, Hjelm R D, Courville A (2018) MINE: mutual information neural estimation. In: Proceedings of the 35th international conference on machine learning, Stockholm, Sweden Belghazi I, Rajeswar S, Baratin A, Hjelm R D, Courville A (2018) MINE: mutual information neural estimation. In: Proceedings of the 35th international conference on machine learning, Stockholm, Sweden
25.
Zurück zum Zitat Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, cluster analysis: basic concepts and methods, 3rd edn. Elsevier, Amsterdam, pp 443–495 Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, cluster analysis: basic concepts and methods, 3rd edn. Elsevier, Amsterdam, pp 443–495
26.
Zurück zum Zitat Zhou HB, Gao JT (2014) Automatic method for determining cluster number based on silhouette coefficient. Adv Mater Res 951:227–230CrossRef Zhou HB, Gao JT (2014) Automatic method for determining cluster number based on silhouette coefficient. Adv Mater Res 951:227–230CrossRef
27.
Zurück zum Zitat Sitompul OS, Nababan EB (2018) Optimization model of K-means clustering using artificial neural networks to handle class imbalance problem. In: IOP conference series: materials science and engineering, vol . 288, no. 1, p 12075 Sitompul OS, Nababan EB (2018) Optimization model of K-means clustering using artificial neural networks to handle class imbalance problem. In: IOP conference series: materials science and engineering, vol . 288, no. 1, p 12075
28.
Zurück zum Zitat Li X (2016) Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans Pattern Anal 12(11):1088–1092CrossRef Li X (2016) Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans Pattern Anal 12(11):1088–1092CrossRef
29.
Zurück zum Zitat Pan L, Zhang B, Yang W, Ram R (2017) A sparse linear model and significance test for individual consumption prediction. IEEE Trans Power Syst 32(6):4489–4500CrossRef Pan L, Zhang B, Yang W, Ram R (2017) A sparse linear model and significance test for individual consumption prediction. IEEE Trans Power Syst 32(6):4489–4500CrossRef
Metadaten
Titel
Failure prediction of tasks in the cloud at an earlier stage: a solution based on domain information mining
verfasst von
Chunhong Liu
Liping Dai
Yi Lai
Guibing Lai
Wentao Mao
Publikationsdatum
19.02.2020
Verlag
Springer Vienna
Erschienen in
Computing / Ausgabe 9/2020
Print ISSN: 0010-485X
Elektronische ISSN: 1436-5057
DOI
https://doi.org/10.1007/s00607-020-00800-1

Weitere Artikel der Ausgabe 9/2020

Computing 9/2020 Zur Ausgabe

Premium Partner