Skip to main content
Erschienen in: Cluster Computing 4/2016

01.12.2016

SPSRG: a prediction approach for correlated failures in distributed computing systems

verfasst von: Weiwei Zheng, Zhili Wang, Haoqiu Huang, Luoming Meng, Xuesong Qiu

Erschienen in: Cluster Computing | Ausgabe 4/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Failure instances in distributed computing systems (DCSs) have exhibited temporal and spatial correlations, where a single failure instance can trigger a set of failure instances simultaneously or successively within a short time interval. In this work, we propose a correlated failure prediction approach (CFPA) to predict correlated failures of computing elements in DCSs. The approach models correlated-failure patterns using the concept of probabilistic shared risk groups and makes a prediction for correlated failures by exploiting an association rule mining approach in a parallel way. We conduct extensive experiments to evaluate the feasibility and effectiveness of CFPA using both failure traces from Los Alamos National Lab and simulated datasets. The experimental results show that the proposed approach outperforms other approaches in both the failure prediction performance and the execution time, and can potentially provide better prediction performance in a larger system.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
1
The operator “\(\backslash \)” denotes set minus as in “\( X \backslash Y\)”, which means ‘ Y is excluded from X’.
 
2
Each \(I_{k}\) corresponds to a certain CE in a DCS.
 
Literatur
1.
Zurück zum Zitat Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., Sahoo, R.: BlueGene/L failure analysis and prediction models. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006) Liang, Y., Zhang, Y., Sivasubramaniam, A., Jette, M., Sahoo, R.: BlueGene/L failure analysis and prediction models. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)
2.
Zurück zum Zitat Asyabi, E., Azhdari, A., Dehsangi, M., Khan, M.G., Sharifi, M., Azhari, S.V.: Kani: a QoS-aware hypervisor-level scheduler for cloud computing environments. Clust. Comput. 19(2), 1–17 (2016)CrossRef Asyabi, E., Azhdari, A., Dehsangi, M., Khan, M.G., Sharifi, M., Azhari, S.V.: Kani: a QoS-aware hypervisor-level scheduler for cloud computing environments. Clust. Comput. 19(2), 1–17 (2016)CrossRef
3.
Zurück zum Zitat Karim, R., Ding, C., Miri, A., Rahman, M.S.: Incorporating service and user information and latent features to predict QoS for selecting and recommending cloud service compositions. Clust. Comput. 19(2), 1–16 (2016) Karim, R., Ding, C., Miri, A., Rahman, M.S.: Incorporating service and user information and latent features to predict QoS for selecting and recommending cloud service compositions. Clust. Comput. 19(2), 1–16 (2016)
4.
Zurück zum Zitat Martini, B., Choo, K.K.R.: An integrated conceptual digital forensic framework for cloud computing. Digit. Investig. 9(9), 71–80 (2012)CrossRef Martini, B., Choo, K.K.R.: An integrated conceptual digital forensic framework for cloud computing. Digit. Investig. 9(9), 71–80 (2012)CrossRef
5.
Zurück zum Zitat Quick, D., Choo, K.K.R.: Dropbox analysis: data remnants on user machines. Digit. Investig. 10(1), 3–18 (2013)CrossRef Quick, D., Choo, K.K.R.: Dropbox analysis: data remnants on user machines. Digit. Investig. 10(1), 3–18 (2013)CrossRef
6.
Zurück zum Zitat Cahyani, N.D.W., Martini, B., Choo, K.R., Al-Azhar, A.M.N.: Forensic data acquisition from cloud-of-things devices: windows smartphones as a case study. Concurr. Comput. Pract. Exp. (2016) Cahyani, N.D.W., Martini, B., Choo, K.R., Al-Azhar, A.M.N.: Forensic data acquisition from cloud-of-things devices: windows smartphones as a case study. Concurr. Comput. Pract. Exp. (2016)
7.
Zurück zum Zitat Quick, D., Choo, K.K.R.: Google drive: forensic analysis of data remnants. J. Netw. Comput. Appl. 40(2), 179–193 (2014)CrossRef Quick, D., Choo, K.K.R.: Google drive: forensic analysis of data remnants. J. Netw. Comput. Appl. 40(2), 179–193 (2014)CrossRef
8.
Zurück zum Zitat Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006) Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)
9.
Zurück zum Zitat Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., Epema, D.: A model for space-correlated failures in large-scale distributed systems. In: Springer European Conference on Parallel Processing, pp. 88–100. (2010) Gallet, M., Yigitbasi, N., Javadi, B., Kondo, D., Iosup, A., Epema, D.: A model for space-correlated failures in large-scale distributed systems. In: Springer European Conference on Parallel Processing, pp. 88–100. (2010)
10.
Zurück zum Zitat Pezoa, J.E., Hayat, M.M.: Reliability of heterogeneous distributed computing systems in the presence of correlated failures. IEEE Trans. Parallel Distrib. Syst. 25(4), 1034–1043 (2014)CrossRef Pezoa, J.E., Hayat, M.M.: Reliability of heterogeneous distributed computing systems in the presence of correlated failures. IEEE Trans. Parallel Distrib. Syst. 25(4), 1034–1043 (2014)CrossRef
11.
Zurück zum Zitat Salfner, F., Schieschke, M., Malek, M.: Predicting failures of computer systems: a case study for a telecommunication system. In: IEEE Parallel and Distributed Processing Symposium (IPDPS’ 06). (2006) Salfner, F., Schieschke, M., Malek, M.: Predicting failures of computer systems: a case study for a telecommunication system. In: IEEE Parallel and Distributed Processing Symposium (IPDPS’ 06). (2006)
12.
Zurück zum Zitat Rahman, N.H.A., Glisson, W.B., Yang, Y., Choo, K.K.R.: Forensic-by-design framework for cyber-physical cloud systems. IEEE Cloud Comput. 3(1), 50–59 (2016)CrossRef Rahman, N.H.A., Glisson, W.B., Yang, Y., Choo, K.K.R.: Forensic-by-design framework for cyber-physical cloud systems. IEEE Cloud Comput. 3(1), 50–59 (2016)CrossRef
13.
Zurück zum Zitat Ab Rahman, N.H., Cahyani, N.D.W., Choo, K.R.: Cloud incident handling and forensic-by-design: cloud storage as a case study. Concurr. Comput. Pract. Exp. (2016) Ab Rahman, N.H., Cahyani, N.D.W., Choo, K.R.: Cloud incident handling and forensic-by-design: cloud storage as a case study. Concurr. Comput. Pract. Exp. (2016)
14.
Zurück zum Zitat Quick, D., Choo, K.K.R.: Digital droplets: microsoft skydrive forensic data remnants. Future Gener. Comput. Syst. 29(6), 1378–1394 (2013)CrossRef Quick, D., Choo, K.K.R.: Digital droplets: microsoft skydrive forensic data remnants. Future Gener. Comput. Syst. 29(6), 1378–1394 (2013)CrossRef
15.
Zurück zum Zitat Tep, K.S., Martini, B., Hunt, R., Choo, K.K.R.: A Taxonomy of cloud attack consequences and mitigation strategies: the role of access control and privileged access management. In: IEEE Trustcom/BigDataSE/ISPA’ 15, pp. 1073–1080. (2015) Tep, K.S., Martini, B., Hunt, R., Choo, K.K.R.: A Taxonomy of cloud attack consequences and mitigation strategies: the role of access control and privileged access management. In: IEEE Trustcom/BigDataSE/ISPA’ 15, pp. 1073–1080. (2015)
16.
Zurück zum Zitat Baldoni, R., Montanari, L., Rizzuto, M.: On-line failure prediction in safety-critical systems. Future Gener. Comput. Syst. 45, 123–132 (2015)CrossRef Baldoni, R., Montanari, L., Rizzuto, M.: On-line failure prediction in safety-critical systems. Future Gener. Comput. Syst. 45, 123–132 (2015)CrossRef
17.
Zurück zum Zitat Quick, D., Choo, K.K.R.: Big forensic data reduction: digital forensic images and electronic evidence. Clust. Comput. 19(2), 1–18 (2016)CrossRef Quick, D., Choo, K.K.R.: Big forensic data reduction: digital forensic images and electronic evidence. Clust. Comput. 19(2), 1–18 (2016)CrossRef
18.
Zurück zum Zitat Martini, B., Choo, K.K.R.: Cloud storage forensics: owncloud as a case study. Digit. Investig. 10(4), 287–299 (2013)CrossRef Martini, B., Choo, K.K.R.: Cloud storage forensics: owncloud as a case study. Digit. Investig. 10(4), 287–299 (2013)CrossRef
19.
Zurück zum Zitat Quick, D., Martini, B., Choo, R.: Cloud Storage Forensics. Syngress Publishing, Boston (2013) Quick, D., Martini, B., Choo, R.: Cloud Storage Forensics. Syngress Publishing, Boston (2013)
20.
Zurück zum Zitat Martini, B., Choo, K.K.R.: Distributed filesystem forensics: xtreemfs as a case study. Digit. Investig. 11(4), 295–313 (2014)CrossRef Martini, B., Choo, K.K.R.: Distributed filesystem forensics: xtreemfs as a case study. Digit. Investig. 11(4), 295–313 (2014)CrossRef
21.
Zurück zum Zitat Quick, D., Choo, K.K.R.: Forensic collection of cloud storage data: does the act of collection result in changes to the data or its metadata? Digit. Investig. 10(3), 266–277 (2013)CrossRef Quick, D., Choo, K.K.R.: Forensic collection of cloud storage data: does the act of collection result in changes to the data or its metadata? Digit. Investig. 10(3), 266–277 (2013)CrossRef
22.
Zurück zum Zitat Fu, S., Xu, C.Z.: Exploring event correlation for failure prediction in coalitions of clusters. In: ACM/IEEE Supercomputing (SC’ 07). (2007) Fu, S., Xu, C.Z.: Exploring event correlation for failure prediction in coalitions of clusters. In: ACM/IEEE Supercomputing (SC’ 07). (2007)
23.
Zurück zum Zitat Salfner, F., Malek, M.: Using hidden Semi-Markov models for effective online failure prediction. In: IEEE International Symposium on Reliable Distributed Systems (SRDS’ 07). (2007) Salfner, F., Malek, M.: Using hidden Semi-Markov models for effective online failure prediction. In: IEEE International Symposium on Reliable Distributed Systems (SRDS’ 07). (2007)
24.
Zurück zum Zitat Murray, J.F., Hughes, G.F., Kreutz-Delgado, K.: Hard drive failure prediction using non-parametric statistical methods. In: ICANN/ICONIP’ 03. (2003) Murray, J.F., Hughes, G.F., Kreutz-Delgado, K.: Hard drive failure prediction using non-parametric statistical methods. In: ICANN/ICONIP’ 03. (2003)
25.
Zurück zum Zitat Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)CrossRef Schroeder, B., Gibson, G.A.: A large-scale study of failures in high-performance computing systems. IEEE Trans. Dependable Secur. Comput. 7(4), 337–350 (2010)CrossRef
26.
Zurück zum Zitat Vidyarthia, D.P., Tripathib, A.K.: Maximizing reliability of a distributed computing system with task allocation using simple genetic algorithm. J. Syst. Archit. 47(6), 549–554 (2001)CrossRef Vidyarthia, D.P., Tripathib, A.K.: Maximizing reliability of a distributed computing system with task allocation using simple genetic algorithm. J. Syst. Archit. 47(6), 549–554 (2001)CrossRef
27.
Zurück zum Zitat Palmer, J., Mitrani, I.: Empirical and analytical evaluation of systems with multiple unreliable servers. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006) Palmer, J., Mitrani, I.: Empirical and analytical evaluation of systems with multiple unreliable servers. In: IEEE/IFIP Dependable Systems and Networks (DSN’ 06). (2006)
29.
Zurück zum Zitat Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. J. ACM Comput. Surv. 42(10), 1–68 (2010)CrossRef Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. J. ACM Comput. Surv. 42(10), 1–68 (2010)CrossRef
31.
Zurück zum Zitat Papadimitriou, D., Poppe, F., Jones, J., Venkatachalam, S., Dharanikota, S., Jain, R., Xue, Y.: Inference of shared risk link groups. IETF draft, OIF contribution, OIF. (2001) Papadimitriou, D., Poppe, F., Jones, J., Venkatachalam, S., Dharanikota, S., Jain, R., Xue, Y.: Inference of shared risk link groups. IETF draft, OIF contribution, OIF. (2001)
32.
Zurück zum Zitat Das, G., Papadimitriou, D., Tavernier, W., Colle, D., Dhaene, T., Pickavet, M., Demeester, P.: Link state protocol data mining for shared risk link group detection. In: IEEE Computer Communications and Networks (ICCCN’ 10), pp. 1–8. (2010) Das, G., Papadimitriou, D., Tavernier, W., Colle, D., Dhaene, T., Pickavet, M., Demeester, P.: Link state protocol data mining for shared risk link group detection. In: IEEE Computer Communications and Networks (ICCCN’ 10), pp. 1–8. (2010)
33.
Zurück zum Zitat Soysal, Ö.M.: Association rule mining with mostly associated sequential patterns. Exp. Syst. Appl. 42(5), 2582–2592 (2015)CrossRef Soysal, Ö.M.: Association rule mining with mostly associated sequential patterns. Exp. Syst. Appl. 42(5), 2582–2592 (2015)CrossRef
34.
Zurück zum Zitat Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
36.
Zurück zum Zitat Bai, C.G., Hu, Q.P., Xie, M., Ng, S.H.: Software failure prediction based on a Markov Bayesian network model. J. Syst. Softw. 74(3), 275–282 (2005)CrossRef Bai, C.G., Hu, Q.P., Xie, M., Ng, S.H.: Software failure prediction based on a Markov Bayesian network model. J. Syst. Softw. 74(3), 275–282 (2005)CrossRef
37.
Zurück zum Zitat Hughes, G.F., Murray, J.F., Kreutz-Delgado, K., Elkan, C.: Improved disk-drive failure warnings. IEEE Trans. Reliab. 51(3), 350–357 (2002)CrossRef Hughes, G.F., Murray, J.F., Kreutz-Delgado, K., Elkan, C.: Improved disk-drive failure warnings. IEEE Trans. Reliab. 51(3), 350–357 (2002)CrossRef
38.
Zurück zum Zitat Fu, S., Xu, C.Z.: Quantifying temporal and spatial correlation of failure events for proactive management. In: IEEE Reliable Distributed Systems (RNS’ 07). (2007) Fu, S., Xu, C.Z.: Quantifying temporal and spatial correlation of failure events for proactive management. In: IEEE Reliable Distributed Systems (RNS’ 07). (2007)
39.
Zurück zum Zitat Jhawar, R., Piuri, V.: Fault tolerance and resilience in cloud computing environments. Comput. Inf. Secur. Handb. (2013) Jhawar, R., Piuri, V.: Fault tolerance and resilience in cloud computing environments. Comput. Inf. Secur. Handb. (2013)
40.
Zurück zum Zitat Yigitbasi, N., Gallet, M., Kondo, D., et al.: Analysis and modeling of time-correlated failures in large-scale distributed systems. In: IEEE/ACM Grid Computing (GRID’ 10), pp. 65–72. (2010) Yigitbasi, N., Gallet, M., Kondo, D., et al.: Analysis and modeling of time-correlated failures in large-scale distributed systems. In: IEEE/ACM Grid Computing (GRID’ 10), pp. 65–72. (2010)
41.
Zurück zum Zitat Hoffmann, G., Malek, M.: Call availability prediction in a telecommunication system: a data driven empirical approach. In: IEEE SRDS’ 06, pp. 83–95. (2006) Hoffmann, G., Malek, M.: Call availability prediction in a telecommunication system: a data driven empirical approach. In: IEEE SRDS’ 06, pp. 83–95. (2006)
42.
Zurück zum Zitat Neumayer, S., Modiano, E.: Network reliability with geographically correlated failures. In: IEEE INFOCOM’ 10, pp. 1–9. (2010) Neumayer, S., Modiano, E.: Network reliability with geographically correlated failures. In: IEEE INFOCOM’ 10, pp. 1–9. (2010)
43.
Zurück zum Zitat Kim, K., Venkatasubramanian, N.: Assessing the impact of geographically correlated failures on overlay-based data dissemination. In: IEEE GLOBECOM’ 10, pp. 1–5. (2010) Kim, K., Venkatasubramanian, N.: Assessing the impact of geographically correlated failures on overlay-based data dissemination. In: IEEE GLOBECOM’ 10, pp. 1–5. (2010)
44.
Zurück zum Zitat Fiondella, L., Rajasekaran, S., Gokhale, S.S.: Efficient software reliability analysis with correlated component failures. IEEE Trans. Reliab. 62(1), 244–255 (2013)CrossRef Fiondella, L., Rajasekaran, S., Gokhale, S.S.: Efficient software reliability analysis with correlated component failures. IEEE Trans. Reliab. 62(1), 244–255 (2013)CrossRef
Metadaten
Titel
SPSRG: a prediction approach for correlated failures in distributed computing systems
verfasst von
Weiwei Zheng
Zhili Wang
Haoqiu Huang
Luoming Meng
Xuesong Qiu
Publikationsdatum
01.12.2016
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe 4/2016
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-016-0633-2

Weitere Artikel der Ausgabe 4/2016

Cluster Computing 4/2016 Zur Ausgabe

Premium Partner