Introduction
Background
Related work
The idea of solution
The process of container cascade fault detection
Construction of container cascade fault relation model
Finding frequently associated fault container instance sequences
Notation | Definition |
---|---|
D | Historical Sequence |
C | Association Sequence |
L | Association Set |
\({S}_{T}(i,j)\) | The spatial correlation strength of the cascade fault |
\({S}_{s}(i,j)\) | The temporal correlation strength of the cascade fault |
\({S}_{O}(i,j)\) | The spatial–temporal correlation strength of the cascade fault |
\({x}_{T}, {x}_{S}\) | weights of temporal correlation and spatial correlation degree |
\(S=\{{S}_{O}\left(1,1\right),\dots ,{S}_{O}(n,n)\}\) | Spatial–temporal correlation degree matrix |
\(M\) | base model |
\(\delta\) | integrated model threshold |
\(F(x)\) | Final integrated model |
\({x}_{i}\) | container historical performance data |
Calculation of the spatial–temporal correlation degree of container faults
Model learning method for imbalanced historical fault data
Container cascade fault detection based on the cascade fault relational model
-
Data acquisition and input. Obtain container historical performance data (including container CPU, memory, IO, etc.) at the previous n time. The state (the normal state is marked 0, and the fault state is marked 1) and the spatial–temporal correlation degree of the container are combined as input data, denoted as \(\left\{{x}_{1},{x}_{2},\cdots ,{x}_{n}\right\}\), and they are input into the integrated model \(\mathrm{F}(\mathrm{x})\) trained above, where \({x}_{i}=\{ID,PerformanceData,Associatedmatrix,Othercontainerstates,State\}\).
-
Fault detection process. The integrated model F(x) detects whether the container will fail at time (n + 1) based on the input vector\(\left\{{x}_{1},{x}_{2},\cdots ,{x}_{n}\right\}\). At the same time, it obtains the above data at time (n + 1) by the sliding window mechanism. Based on the sequence\(\left\{{x}_{2},{x}_{3},\cdots ,{x}_{n+1}\right\}\), the container state at time (n + 2) is detected. The detection window is continuously sliding to detect the possibility of container fault.$${output}_{i}=model.predict\left(\left\{{x}_{i},{x}_{i+1},\cdots ,{x}_{n+i-1}\right\}\right)$$(6)
-
Results output. The output of the model is transformed into the failure probability of the current container state by the sigmoid activation function, and the value range is between 0 to1. The greater the value is, the greater the possibility of failure.$${predict}_{i}=sigmoid\left({output}_{i}\right)$$(7)
Experiments and evaluation
Set up
Node | Hardware configuration | Container instances |
---|---|---|
1 | 2.4 GHz Intel Xeon E5-260 CPU,2G RAM | node exporter, Grafana, cAdvisor, ceph admin, hadoop master |
2 | 2.4 GHz Intel Xeon E5-260 CPU,2G RAM | mysql, InfluxDB, web, web, ceph1, hadoop datanode1 |
3 | 2.4 GHz Intel Xeon E5-260 CPU,2G RAM | python, code, ceph2 |
4 | 2.4 GHz Intel Xeon E5-260 CPU,2G RAM | ceph3, hadoop datanode2 |
Type of failure | Description | Action |
---|---|---|
CPU failure | Simulate the container to request CPU usage, make the system reach the limit, and cause the container to crash | Stress-ng generates CPU pressure in the specified container to occupy all allocated CPU |
Memory failure | Simulate the container to request memory usage, make the system reach the limit, and cause the container to crash | Stress-ng simulates the creation of objects pointing to global static variables in the container, gradually filling up the memory of the container |
Network failure | Simulate container network bandwidth is fully occupied | iPerf Continuously sends data packets to other container instances until the requested network bandwidth is occupied |
Disk failure | Simulate container disk IO is fully occupied | Stress-ng simulates the disk write() function, continuously writes to the disk, occupying the container disk IO |
Evaluation of fault relational model
Performance comparison of cascade fault relational models
Model | The number of fault propagation path | Time to construct/sec |
---|---|---|
Apriori | 376 | 13,608 |
LCS | 1491 | 10,233 |
CFD-STC | 376 | 13,834 |
Comparison of fault detection effects
Metrics | LCS + LSTM | Apriori + LSTM | CFD-STC + LSTM |
---|---|---|---|
RMSE | 0.549 | 0.414 | 0.287 |
MAPE | 0.607 | 0.511 | 0.315 |
R2 | 0.622 | 0.761 | 0.908 |
Precision | 0.689 | 0.736 | 0.823 |
Recall | 0.431 | 0.707 | 0.808 |
F1 | 0.530 | 0.722 | 0.815 |
Effect evaluation of ensemble learning optimization method
Comparison of different models
Metrics | CFD-STC | CFD-STC-E |
---|---|---|
RMSE | 0.287 | 0.213 |
MAPE | 0.315 | 0.232 |
R2 | 0.908 | 0.950 |
Precision | 0.823 | 0.913 |
Recall | 0.808 | 0.895 |
\({F}_{1}\) | 0.815 | 0.904 |
Metrics | CFD-STC | CFD-STC-E |
---|---|---|
RMSE | 0.281 | 0.199 |
MAPE | 0.302 | 0.203 |
R2 | 0.917 | 0.961 |
Precision | 0.863 | 0.920 |
Recall | 0.824 | 0.903 |
\({F}_{1}\) | 0.843 | 0.911 |