3.1. Deep Auto-Encoder
The auto-encoder (AE) [
37] consists of an encoder and a decoder. The essence of the encoder and decoder is the hidden layer in the neural network. The parameters of the AE are optimized based on the backpropagation algorithm and the gradient descent algorithm.
The process from the input layer to the hidden layer is called the encoder. The task of the encoder is to convert the given input data
to a lower-dimensional representation
. The conversion process is represented by Equation (
1).
where
is the weight parameter,
b is the offset value,
f is the activation function, and
h is the hidden vector obtained by the encoder.
The decoder is the inverse process of the encoder. The hidden vector is restored to the input vector through the inverse operation by Equation (
2), where
is the weight parameter of each layer of the decoder,
c is the offset value of each layer of the decoder, and
is the input from the encoder to the decoder.
The encoder neural network will get the minimum of loss function through the gradient descent algorithm and then obtain the optimal parameters
W,
b, and
c. The gradient descent algorithm is represented by Equation (
3), where
is the actual tag value,
is the output tag value, and n is the number of samples.
A single auto-encoder is able to learn limited feature variation through a three-layer network of fictitious input layer → hidden layer → output layer, but for classification tasks involving deep features, the shallow data features obtained through such self-encoder structures often increase the computational effort of subsequent classification tasks, so a deep auto-encoder can be used to extract data features and learn the original data layer by layer multiple representations. The structure of the deep auto-encoder is shown in
Figure 2. Each layer is based on the expression of the underlying layer. The extracted features can be more abstract and more suitable for complex classification tasks.
The expression x represents the input data, and the initial feature expression is obtained after being processed by the hidden layer. The idea of the deep auto-encoder is to increase the number of hidden layers of the encoder and decoder so that the subsequent hidden layers continue to extract more abstract features on . After completing the training, the output of the encoder will be a good representative of the original data.
3.2. Convolutional Neural Networks
CNN [
38] is a typical deep learning algorithm. It is a feedforward neural network with a deep structure that includes convolution calculations. CNN is mainly composed of three layers: an input layer, a hidden layer, and an output layer. The hidden layer is the key component of CNN. It includes a convolutional layer, a pooling layer, and a fully connected layer.
Convolutional layer: This layer is used to extract features from the input data. It has a few convolution kernels. Each element of the convolution kernel corresponds to a weight and a bias. In a convolutional layer, the feature vector of the previous layer is convolved with the convolution kernel through the activation function to obtain the output vector. The convolution process can enhance the feature of the original data and reduce the effect of noise;
is the output of the
lth layer, and
is the
jth feature vector of the
lth convolutional layer. The value of
can be calculated by Equation (
4).
where
is the selection set of input feature vectors,
is
jth convolution kernel parameter of the input feature
i, ⊗ is the convolution operation,
is the additive offset, and the Rectified Linear Unit(ReLU) is used as the activation function. ReLU can overcome the vanishing gradient problem.
Pooling layer: The pooling layer usually follows the convolutional layer. It is used for the second feature extraction operation. It can effectively avoid the overfitting problem and strengthen the robustness of the network. The pooling layer performs statistical calculations on the output features from the convolution layer to obtain the statistical probability features instead of the original features. The pooling layer is responsible for downsampling the input vector, which is shown as Equation (
5).
where the function
is responsible for downsampling the
jth vector in
th layer;
and
are the multiplicative bias and additive bias, respectively.
Fully connected layer: The fully connected layer in the convolutional neural network is similar to the hidden layer in the traditional feedforward neural network. The fully connected layer is located in the last part of the hidden layer of the convolutional neural network and only transmits signals to other fully connected layers. The feature map is expanded into a vector and loses the spatial topology in the fully connected layer. Each neuron of the fully connected layer is connected to the neuron of the feature vector of the previous layer one by one, and the output of each neuron is expressed by Equation (
6).
where
x is the input of the neuron,
is the output of the neuron,
are the corresponding weight and offset parameter,
is the transpose of the parameter matrix, and
b is the offset.
Output layer: The output layer is followed by the fully connected layer. The output layer of the convolutional neural network uses the softmax regression function that is similar to the output layer of the traditional fully connected neural network. For a given test input
x, the probability value
that it belongs to the
jth class is calculated. The function is supposed to output one
k-dimensional vector, representing
k estimated probabilities. The system equation is expressed as Equation (
7), where
represents the probability that softmax judges the sample as
i,
e is introduced to facilitate the subsequent backpropagation derivative calculation,
represents the layer parameter,
k represents the total number of classifications, and
j represents other classifications. The output value of the function represents the final probability that the sample belongs to class
i.
3.3. Attention Mechanism
The attention mechanism [
39] is a characteristic mechanism of human vision, in which humans quickly scan an image from the whole picture to obtain the regions that require focused attention and later devote more attention resources in these regions to suppress attention to other unimportant information.
The attention mechanism in deep learning is essentially similar to the human selective visual attention mechanism, and the core goal is also to select the information that is more critical to the current task goal from the many pieces of information. The essence is to obtain a new representation by linear weighting based on the relationships between things.
For example, a text needs to be scored. Each text has a corresponding vector representation and a retrieval library
, where
is a pair of vector representation and rating, now given a text
q to be rated, the traditional method to calculate the rating requires calculating the similarity between the query text
q and the text inside each retrieval library
; then, we can weight the similarity to get the prediction score
:
Correspondingly, attention is given a vector representation
q of a query and the corresponding retrieval library
, where
is the vector representation corresponding to keyword
. Then, as shown in
Figure 3, in order to utilize the knowledge inside
q and the retrieval library, the representation of
q needs to be transformed as follows, where
is the converted value.
Depending on the method of similarity calculation, there are different kinds of attention, including inner product similarity, cosine similarity, and splicing similarity, where q and k have the same meaning as above; is the transpose of the parameter matrix.
Inner product similarity:
Finally, a set of weights is obtained based on the similarity: , and this weight is normalized.
Attention is to calculate a set of weights using the relationship between things, and then perform a weighted representation to get a new representation, which can be understood as a method of feature transformation.
One of the most commonly used attention mechanisms is feedforward attention, where the query is set to be learnable with the parameter
, and then the key and value inside the retrieval library are set to be the same. The attention mechanism is then obtained using a layer of neural networks to compute attention weights. As shown in the following equation,
w is the attention weight,
is the keyword vector, and
k is the number of keywords.
3.4. The DCA Model
Our model is composed of a deep auto-encoder, CNN, and attention, so we name the model the DCA model. The structure of the DCA model proposed in this paper is shown in
Figure 4.
In the data preprocessing phase, the first step is to clean the data. Because the dataset is generated from a simulated environment with limited flow-generating devices, Flow ID, Src IP, Src Port, Dst IP, Dst Port, and Timestamp attributes are relatively fixed and not universal. These six features are discarded and 78 features are retained.
For binary classification experiments, the dataset labels need to be modified to 0 and 1, representing normal and abnormal traffic. For the multiclassification experiments, we one-hot coded eight different labels and used 000, 001, 010, 011, 100, 101, 110, and 111 as labels for different attribute flows.
After the numericalization of features, the data are regularized by scaling each sample to unit parity using the
paradigm to avoid the generation of overfitting in the subsequent training process and to reduce the network error. For this dataset, the first 77 dimensions of a sample are data and the 78th dimension is labels. We input the first 77 dimensions of data in the form of
into the model, labeled as
y, and its regularization process is as follows:
where
is:
where
w is the vector of weight coefficients,
J is the cost function,
is the parameter controlling the degree of regularization, and
is the
parametrization of
w. The product of the two is used as the penalty term of the cost function to penalize the high-complexity model.
For the preprocessed data, we assume that one of the samples is
. We input it into the deep auto-encoder layer of the model:
where
is the hidden layer output,
,
n is the number of neurons in the hidden layer,
is the weight matrix of the
ith neuron in the hidden layer, and
is the sigmoid activation function, which is given by:
After processing by the deep auto-encoder, we get the reduced dimensional data matrix: , , .
We feed the reduced-dimensional data matrix A to the CNN. In the convolutional layer, we assume that the convolution kernel is a matrix
, and
is the convolutional kernel size. We slide the convolution kernel over the matrix
A:
Here,
is the output matrix after convolution operation by convolution kernel, and
is the size of the output matrix after convolution operation. The ReLU activation function will process the matrix
B. The equation of the ReLU activation function is:
After the maxpooling operation, the matrix is divided into several regions of size 2*1, and the large values are kept to get a new matrix , which is half the size of the original matrix.
The role of convolution kernels is parameter sharing. A piece of input features of the same size as the convolutional kernel is computed using the same parameters, and different convolutional kernels have different parameters and extract different features. This prevents the problem of parameter explosion in deep convolutional neural networks and greatly reduces the amount of operations.
After the CNN processing, data will be input into the attention mechanism. Suppose the CNN output sequence is
:
where
denotes the attention allocation coefficient of the jth word of the original input at the ith output, and the larger the coefficient, the higher the importance of the current information.
The attention mechanism would use the weighting of elements within each local feature in the feature graph to obtain its weight score. However, this would result in the attention mechanism ignoring the correlation between the local features and the strong information redundancy between the features. In this model, the original features are first downscaled by deep auto-encoder to extract deep abstract features, and then processed by CNN layers to further extract deep features. After the processing of the first two steps, feature redundancy and noise are greatly reduced. The attention mechanism will further weight and strengthen important data features based on the data processed by CNN. This processing flow makes the whole model grasp the key information of features accurately, which is why it has a high recognition accuracy.
Suppose the output vector of the attention mechanism is
x. We input
x into the classifier and get the classification result. For the binary classification experiment, using the sigmoid function as the classifier, the probability that the sample is a normal flow is:
The probability that the sample is an abnormal flow is:
The joint sample probabilities are:
where
,
is the transpose of the weight parameter matrix, and
x is the vector of the input classifier.
After obtaining the sample probabilities, they are fed into the loss function and back-propagated using the chain derivative rule to continuously update the weight parameter
and the bias parameter
b for each layer in order to minimize the cost function. This experiment uses the binary cross-entropy function as the loss function for back propagation:
where
N is the total number of samples,
y is the flow label, and
is the probability that the sample is a normal flow.
For the multiclassification experiment, there are eight classifiers representing eight different types of traffic, and each classifier output is the probability that a sample belongs to that type of traffic. Using the softmax function as a classifier, the process is similar to binary classification, where the eight classifiers output the probability that eight samples belong to that class, and the highest probability is the predicted classification. The probability is derived for each classifier using the following equation:
where
k is the number of label types and
is the weight parameter. The cost function used for multiclassification is:
where
is an operation defined with the value of 1 when
and 0 when
;
N is the total number of samples and
k is the number of label types.