Skip to main content

2018 | OriginalPaper | Buchkapitel

Fault Injection and Detection for Artificial Intelligence Applications in Container-Based Clouds

verfasst von : Kejiang Ye, Yangyang Liu, Guoyao Xu, Cheng-Zhong Xu

Erschienen in: Cloud Computing – CLOUD 2018

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Container technique is increasingly used to build modern cloud computing systems to achieve higher efficiency and lower resource costs, as compared with traditional virtual machine technique. Artificial intelligence (AI) is a mainstream method to deal with big data, and is used in many areas to achieve better effectiveness. It is known that attacks happen every day in production cloud systems, however, the fault behaviors and interferences of up-to-date AI applications in container-based cloud systems is still not clear. This paper aims to study the reliability issue of container-based clouds. We first propose a fault injection framework for container-based cloud systems. We build a docker container environment installed with TensorFlow deep learning framework, and develop four typical attack programs, i.e., CPU attack, Memory attack, Disk attack and DDOS attack. Then, we inject the attack programs to the containers running AI applications (CNN, RNN, BRNN and DRNN), to observe fault behaviors and interferences phenomenon. After that, we design fault detection models based on quantile regression method to detect potential faults in containers. Experimental results show the proposed fault detection models can effectively detect the injected faults with more than 60% Precision, more than 90% Recall and nearly 100% Accuracy.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the Tenth European Conference on Computer Systems, p. 18. ACM (2015) Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the Tenth European Conference on Computer Systems, p. 18. ACM (2015)
2.
Zurück zum Zitat Lu, C., Ye, K., Xu, G., Xu, C.-Z., Bai, T.: Imbalance in the cloud: an analysis on Alibaba cluster trace. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 2884–2892. IEEE (2017) Lu, C., Ye, K., Xu, G., Xu, C.-Z., Bai, T.: Imbalance in the cloud: an analysis on Alibaba cluster trace. In: 2017 IEEE International Conference on Big Data (Big Data), pp. 2884–2892. IEEE (2017)
3.
Zurück zum Zitat Felter, W., Ferreira, A., Rajamony, R., Rubio, J.: An updated performance comparison of virtual machines and Linux containers. In: 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 171–172. IEEE (2015) Felter, W., Ferreira, A., Rajamony, R., Rubio, J.: An updated performance comparison of virtual machines and Linux containers. In: 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 171–172. IEEE (2015)
10.
Zurück zum Zitat Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI, vol. 16, pp. 265–283 (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI, vol. 16, pp. 265–283 (2016)
11.
Zurück zum Zitat Harter, T., Salmon, B., Liu, R., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Slacker: fast distribution with lazy Docker containers. In: FAST, pp. 181–195 (2016) Harter, T., Salmon, B., Liu, R., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Slacker: fast distribution with lazy Docker containers. In: FAST, pp. 181–195 (2016)
13.
Zurück zum Zitat Ye, K., Ji, Y.: Performance tuning and modeling for big data applications in Docker containers. In: 2017 12th IEEE International Conference on Networking, Architecture, and Storage (NAS 2017). IEEE (2017) Ye, K., Ji, Y.: Performance tuning and modeling for big data applications in Docker containers. In: 2017 12th IEEE International Conference on Networking, Architecture, and Storage (NAS 2017). IEEE (2017)
14.
Zurück zum Zitat Yu, Y., Zou, H., Tang, W., Liu, L., Teng, F.: Flex tuner: a flexible container-based tuning system for cloud applications. In: 2015 IEEE International Conference on Cloud Engineering (IC2E), pp. 145–154. IEEE (2015) Yu, Y., Zou, H., Tang, W., Liu, L., Teng, F.: Flex tuner: a flexible container-based tuning system for cloud applications. In: 2015 IEEE International Conference on Cloud Engineering (IC2E), pp. 145–154. IEEE (2015)
16.
Zurück zum Zitat Veeraraghavan, K., et al.: Kraken: leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services. In: OSDI (2016) Veeraraghavan, K., et al.: Kraken: leveraging live traffic tests to identify and resolve resource utilization bottlenecks in large scale web services. In: OSDI (2016)
17.
Zurück zum Zitat Gunawi, H.S., et al.: Why does the cloud stop computing? Lessons from hundreds of service outages. In: SoCC (2016) Gunawi, H.S., et al.: Why does the cloud stop computing? Lessons from hundreds of service outages. In: SoCC (2016)
20.
Zurück zum Zitat De Oliveira, A.B., Fischmeister, S., Diwan, A., Hauswirth, M., Sweeney, P.F.: Why you should care about quantile regression. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 207–218. ACM (2013) De Oliveira, A.B., Fischmeister, S., Diwan, A., Hauswirth, M., Sweeney, P.F.: Why you should care about quantile regression. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 207–218. ACM (2013)
22.
Zurück zum Zitat Duffield, N., Haffner, P., Krishnamurthy, B., Ringberg, H.: Rule-based anomaly detection on IP flows. In: IEEE International Conference on Computer Communications (INFOCOM), pp. 424–432. IEEE (2009) Duffield, N., Haffner, P., Krishnamurthy, B., Ringberg, H.: Rule-based anomaly detection on IP flows. In: IEEE International Conference on Computer Communications (INFOCOM), pp. 424–432. IEEE (2009)
23.
Zurück zum Zitat Cherkasova, L., Ozonat, K., Mi, N., Symons, J., Smirni, E.: Anomaly? Application change? Or workload change? Towards automated detection of application performance anomaly and change. In: 2008 IEEE International Conference on Dependable Systems and Networks (DSN), pp. 452–461. IEEE (2008) Cherkasova, L., Ozonat, K., Mi, N., Symons, J., Smirni, E.: Anomaly? Application change? Or workload change? Towards automated detection of application performance anomaly and change. In: 2008 IEEE International Conference on Dependable Systems and Networks (DSN), pp. 452–461. IEEE (2008)
24.
Zurück zum Zitat Sharma, A.B., Chen, H., Ding, M., Yoshihira, K., Jiang, G.: Fault detection and localization in distributed systems using invariant relationships. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–8. IEEE (2013) Sharma, A.B., Chen, H., Ding, M., Yoshihira, K., Jiang, G.: Fault detection and localization in distributed systems using invariant relationships. In: 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–8. IEEE (2013)
25.
Zurück zum Zitat Pannu, H.S., Liu, J., Fu, S.: Aad: adaptive anomaly detection system for cloud computing infrastructures. In: 2012 IEEE 31st Symposium on Reliable Distributed Systems (SRDS), pp. 396–397. IEEE (2012) Pannu, H.S., Liu, J., Fu, S.: Aad: adaptive anomaly detection system for cloud computing infrastructures. In: 2012 IEEE 31st Symposium on Reliable Distributed Systems (SRDS), pp. 396–397. IEEE (2012)
26.
Zurück zum Zitat Tan, Y., Nguyen, H., Shen, Z., Gu, X., Venkatramani, C., Rajan, D.: Prepare: predictive performance anomaly prevention for virtualized cloud systems. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS), pp. 285–294. IEEE (2012) Tan, Y., Nguyen, H., Shen, Z., Gu, X., Venkatramani, C., Rajan, D.: Prepare: predictive performance anomaly prevention for virtualized cloud systems. In: 2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS), pp. 285–294. IEEE (2012)
27.
Zurück zum Zitat Tan, Y., Gu, X., Wang, H.: Adaptive system anomaly prediction for large-scale hosting infrastructures. In: Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC), pp. 173–182. ACM (2010) Tan, Y., Gu, X., Wang, H.: Adaptive system anomaly prediction for large-scale hosting infrastructures. In: Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing (PODC), pp. 173–182. ACM (2010)
28.
Zurück zum Zitat Tan, Y., Gu, X.: On predictability of system anomalies in real world. In: 2010 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 133–140. IEEE (2010) Tan, Y., Gu, X.: On predictability of system anomalies in real world. In: 2010 IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), pp. 133–140. IEEE (2010)
29.
Zurück zum Zitat Bronevetsky, G., Laguna, I., De Supinski, B.R., Bagchi, S.: Automatic fault characterization via abnormality-enhanced classification. In: 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12. IEEE (2012) Bronevetsky, G., Laguna, I., De Supinski, B.R., Bagchi, S.: Automatic fault characterization via abnormality-enhanced classification. In: 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 1–12. IEEE (2012)
30.
Zurück zum Zitat Gu, Z., Pei, K., Wang, Q., Si, L., Zhang, X., Xu, D.: Leaps: detecting camouflaged attacks with statistical learning guided by program analysis. pp. 57–68 (2015) Gu, Z., Pei, K., Wang, Q., Si, L., Zhang, X., Xu, D.: Leaps: detecting camouflaged attacks with statistical learning guided by program analysis. pp. 57–68 (2015)
31.
Zurück zum Zitat Fu, S., Xu, C.-Z.: Exploring event correlation for failure prediction in coalitions of clusters. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC), pp. 1–12. IEEE (2007) Fu, S., Xu, C.-Z.: Exploring event correlation for failure prediction in coalitions of clusters. In: Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC), pp. 1–12. IEEE (2007)
32.
Zurück zum Zitat Nguyen, H., Shen, Z., Tan, Y., Gu, X.: Fchain: toward black-box online fault localization for cloud systems. In: 2013 IEEE 33rd International Conference on Distributed Computing Systems (ICDCS), pp. 21–30. IEEE (2013) Nguyen, H., Shen, Z., Tan, Y., Gu, X.: Fchain: toward black-box online fault localization for cloud systems. In: 2013 IEEE 33rd International Conference on Distributed Computing Systems (ICDCS), pp. 21–30. IEEE (2013)
33.
Zurück zum Zitat Arnautov, S., Trach, B., Gregor, F., Knauth, T., Martin, A., Priebe, C., Lind, J., Muthukumaran, D., OKeeffe, D., Stillwell, M.L., et al.: Scone: Secure Linux containers with Intel SGX. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2016) Arnautov, S., Trach, B., Gregor, F., Knauth, T., Martin, A., Priebe, C., Lind, J., Muthukumaran, D., OKeeffe, D., Stillwell, M.L., et al.: Scone: Secure Linux containers with Intel SGX. In: 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI) (2016)
Metadaten
Titel
Fault Injection and Detection for Artificial Intelligence Applications in Container-Based Clouds
verfasst von
Kejiang Ye
Yangyang Liu
Guoyao Xu
Cheng-Zhong Xu
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-94295-7_8

Premium Partner