1. Introduction
Video captioning is one of the notable studies in the computer vision–natural language processing connection. The model understands video and creates captions explaining video via visual data such as frame representation, motion data, and objects. Therefore, the caption represents the information of the video or something changing in the video. Recently, it was revealed that the encoder–decoder architecture is helpful in video captioning. In addition, the architecture, in the previous part of the encoding part, extracts a feature by weight-freezing pre-trained feature extraction models and handles the feature to find the decisive points of the video information. Those methods use not only one kind of feature, such as an appearance feature, but also several kinds of features to deal with more information from videos and process the features in various ways.
Several papers [
1,
2,
3,
4] show various methods of captioning. Such video captioning processes typically require a video feature extraction process to convert raw pixel data to the vector form that is required in the entire deep-learning process. Moreover, the pre-trained CNNs have been required for each feature extraction process. For example, in the ORG-TRL [
5], the appearance feature that represents frame information is extracted by 2D CNNs, 3D CNNs extract the motion feature, and an object-detection network such as Faster-RCNN extracts the object feature on video.
Because of using pre-trained CNNs to convert a video to features, firstly, the captioning performance is affected by the feature extraction network performance. As can be seen from the experimental results of MGRMP [
6], when the network that extracted motion features changed C3D to 3D-ResNext, it showed excellent performance improvement even though it was the same architecture. That proves that good feature extraction significantly influences good capturing performance. Additionally, the E2E Video Captioning [
7] proposed the method to optimize the feature extraction network via end-to-end learning.
However, there are some limitations to the traditional feature extraction model. While a weight-freezing network is efficient for feature extraction, it has the disadvantage that it does not update while the entire model is training on new data. In addition, because it is based on CNNs, and since CNNs have a local receptive field, it makes the performance bound. In contrast, the transformer has a global receptive field because of the self-attention layer, improving the model performance when pre-trained well. In many fields [
8,
9,
10], transformer networks outperform CNNs. Therefore, attempting to convert CNN-based feature extraction models to transformer-based models is natural. Inspired by recent studies that apply transformer networks to vision tasks, we propose the full transformer architecture for video captioning. From feature extraction to the part that extracted appearance features, proceeding through to the transformer. We make the model consisting of the (ViT) [
11] and adopt end-to-end learning. Moreover, the feature extraction gate (FEG) is proposed to acquire a much better understanding of visual features. The FEG is used to obtain better information and combines
CLS token information and previously discarded patch sequence information to extract information that better contains visual content.
Furthermore, we use all encoder layer outputs to resolve the lack of information caused by using one type of feature. As used in the M2 transformer [
12], each encoder layer output enters each layer of the decoder as input. In [
13], the authors analyzed that each encoder layer output has slightly different information about the relationship between features. Therefore, because each encoder layer output means a different relation of frame feature, we expected the same effect as multi-feature when using multi encoder layer output. At this time, we add additional self-attention to check how the encoder layer outputs are related and to further strengthen the video information. To perform this self-attention, each encoder layer output to be activated must pass through the same network. Thus, the model was designed based on the universal transformer, a layer weight-sharing structure. Furthermore, we named this method the universal encoder layer attention (UEA). In addition, we named our model universal attention transformer (UAT).
Our contributions are the following: (1) we propose the full transformer video captioning structure optimized via end-to-end learning. (2) We design the feature extraction gate (FEG) that considers making better features by a fusion of CLS token and patch sequences. (3) We also propose universal encoder layer attention (UEA), constructed to obtain more information from one feature type.
3. Materials and Methods
Figure 1 shows the overall architecture. This overall architecture is composed of two models, the feature extraction model and the captioning model. The appearance feature is extracted from the vision transformer. Our approach for the full-captioning model consists of two components. First, it is the feature extraction gate (FEG) that selects a better feature from the ViT. The second is the encoder channel attention on the captioning model. When the model is run, the appearance feature is extracted by the ViT. After that, it passed the FEG and arrives at the captioning encoder. The captioning encoder is in charge of processing and searching for temporal relations from the frame-feature sequence. After that, the captioning decoder reads the output of the encoder layer and generates the captions. In this process, the relationship between video content and the interaction between video content and words are modeled through scaled dot-product attention [
29], which exists on the encoder and decoder.
Attention is an operation that performs a weighted sum with a value vector by scoring the similarity of the query and key distribution. Since our model consists of a full transformer structure, attention is performed everywhere. The scaled-dot product attention operation can be defined as follows:
where
Q is a matrix consisting of
query vectors and
K and
V; both the matrices consist of
keys and values.
Q,
K, and
V all have the same dimension, and
d is a scaling factor.
Additionally, there is multi-head attention (MHA), which calculates the new expression of
h times in the context of the whole context. The idea of MHA is acquiring
h new expressions that reflect the context and using the matrix by concatenating these various expressions as the attention output. It is formulated as:
where
is a trainable matrix and
h is the number of heads. Our layer normalization is performed before MHA operates. Therefore, the inputs of MHA and
F are the normalized
I, which are input features.
3.1. Feature Extraction Model
ViT Feature Extraction Process. Firstly, we extract T frames from the video. Then, each frame is passed into the transformer encoder for feature extraction. At this time, in order to input the given input pixel data to sequence data, which is an input of the ViT, it must be reshaped by patch embedding. Each pixel datum is divided into a fixed patch size P, and the frame features are reshaped to form a sequence where . Furthermore, the made patch has the dimensions . By embedding, it has dimension size and becomes .
After performing concatenation on the one token, a learnable positional embedding is added and the transformer encoder is entered. This token is called the
CLS token. The encoder layer mechanism is defined as:
where
;
M is the number of feature extraction model layers. LN means layer normalization. FFN is a feed-forward network that consists of a ReLU function and a fully connected layer. The output calculated in this way has N x D, which is equal to the shape of the input.
Feature Extraction Gate. As shown in
Figure 2, unlike other existing methods that use only
CLS tokens, we consider using the entire output sequence to make better features. First, we perform avg-pooling on the patch sequence
to make the same shape as the
CLS token. After that, the weighted sum is performed, and the feature passes through the sigmoid function to create the gate feature ‘
G’, which has a value between 0 and 1. This
G determines which information to take from the
CLS token. Likewise,
is used to control the avg-pooled feature sequence. After that, we add two features after an element-wise multiplication of
G with the
CLS token and
with the pooled frame sequence. In this way, features with the shape of
are obtained that combine not only
CLS token information but also patch features information in one frame. Namely, this gate structure compares features and makes the fusion of the
CLS token and the patch sequence. It is formulated as:
where
represents the trainable weights. The two features
are concatenated and sum-weighted. After making the gate feature, it passes the sigmoid function. That
is calculated with
, and the
is calculated with
. The formula is as follows:
where ⊙ means an element-wise multiplication. We named this module the feature extraction gate (FEG).
3.2. Captioning Model
Captioning Encoder. Our captioning encoder has a role in analyzing the temporal relation of the extracted frame features which ViT makes. The input sequence length in the captioning part is T, the same as the number of keyframes. In this process, the relationship of keyframes that appears in temporal information of video content is learned from the captioning encoder. Because the captioning encoder is the same structure as the feature extraction encoder, the encoder performs the attention operation, similar to the feature extraction encoder. However, we need additional positional embedding to learn temporal features. Since our feature extraction model is the ViT that embeds patches about a 1-frame image and performs the attention operation to spatial information, it only performs spatial embedding. Therefore, we add a positional embedding to the output of FEG to make the model learn temporal relationships.
Next, we stacked all encoder outputs. These stacked encoder layer outputs are as follows:
where
;
L is the number of layers. This stacked feature is used in the captioning decoder.
Captioning Decoder. is the result of the masked multi-head self-attention of
.
is the embedded vector of the target words, where
.
W is the maximum length of a sentence. Next, the second MHA on the decoder generates the channel attentive feature. The formula is:
where
;
. We named the MHA operation that performs for
, Equation (
10), as the channel self-attention (CSA). We construct this CSA with the residual connections, so after MHA is performed, the query vector is summed to the output vector. Additionally, the attention layer input is normalized by layer normalization before the attention operates, as mentioned in Equation (
4).
Equation (
12) is performed to make the attentive features of
for each channel attentive feature
. When cross multi-head attention is running,
is calculated with
and made into a new vector. This time, the query vector
is not summed. After it is finished, each
is made and stacked once more, and
is made. In addition, because the values are accumulated as many times as there are encoder layers, multi-head channel attention is performed to obtain the attention value with the same size of the query vector
. On the multi-head channel attention, if each encoder layer is considered one channel, each channel’s attention score is calculated. The result
reflects each channel as much as the corresponding score is obtained. Finally, the decoder output
is obtained by layer normalizing and passes through the FFN. By this, the entire cross attention is performed. The entire mechanisms are defined as:
3.3. Universal Structure
We propose a encoder layer attention by channel self-attention. However, the problem is that each encoder is independent, so each encoder layer outputs come from a different layer. The self-attention mechanism creates a new value by comparing and scoring how many results from the same model are related. Therefore, it is pointless to perform self-attention with layers from each different encoder layer. To overcome this problem, we adopt the universal transformer structure.
The universal transformer is a weight-shared structure. Each encoder and decoder layer parameter is weight-shared, meaning the encoder layers’ outputs pass to the next encoder layer, which has the same parameters. After the L step, where L is the number of layers, the encoder outputs are stacked and proceed toward the decoder. The universal structure is defined as:
where
and
is the first input of the universal encoder,
F. Because all output comes from the same layer, channel self-attention performs well, so it could be helpful to find useful features. Moreover, we construct the decoder on a universal network for parameter balance with the encoder.
We defined this method as the universal encoder attention (UEA) that applies CSA, as mentioned in the captioning decoder section, to the universal encoder layer outputs.