Top

Neural Processing Letters

Published in:

Open Access 01-04-2024

Rethinking Position Embedding Methods in the Transformer Architecture

Authors: Xin Zhou, Zhaohui Ren, Shihua Zhou, Zeyu Jiang, TianZhuang Yu, Hengfa Luo

Published in: Neural Processing Letters | Issue 2/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Patentsearch

Off

Abstract

In the transformer architecture, as self-attention reads entire image patches at once, the context of the sequence between patches is omitted. Therefore, the position embedding method is employed to assist the self-attention layers in computing the ordering information of tokens. While many papers simply add the position vector to the corresponding token vector rather than concatenating them, few papers offer a thorough explanation and comparison beyond dimension reduction. However, the addition method is not meaningful because token vectors and position vectors are different physical quantities that cannot be directly combined through addition. Hence, we investigate the disparity in learnable absolute position information between the two embedding methods (concatenation and addition) and compare their performance on models. Experiments demonstrate that the concatenation method can learn more spatial information (such as horizontal, vertical, and angle) than the addition method. Furthermore, it reduces the attention distance in the final few layers. Moreover, the concatenation method exhibits greater robustness and leads to a performance gain of 0.1–0.5% for existing models without additional computation overhead.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

The Transformer architecture [1] has demonstrated remarkable success in various natural language processing (NLP) tasks [2‐4]. It has also been extended to the field of computer vision, with the introduction of Vision Transformer (ViT) [5]. ViT divides an image into patches, which are then treated as semantic features and fed into a Transformer encoder for image classification. Position matrices are injected into these patches to preserve the ordering information. Building upon the achievements of ViT [5], several studies have adopted the Transformer architecture as a backbone for vision tasks such as object detection [6, 7], semantic segmentation [8], instance segmentation [9, 10], and image classification [5, 11]. The central component of these Transformer-based methods is self-attention [1], which allows for parallel processing of all input patches without recurrence or convolution. However, a limitation of self-attention is its inability to capture the sequential order of input data inherently. Thus, the inclusion of position embedding becomes crucial for Transformer models to incorporate the ordering information between tokens.

Currently, position encoding approaches in Transformer-based models can be broadly categorized into three classes: absolute position representation, relative position representation, and other methods like TUPE [12] and RoFormer [13]. In absolute position embedding methods [5, 11], the position vectors are directly added to the corresponding word vectors, with both vectors having the same length. The absolute position information can be encoded manually (e.g., using sine-cosine functions) or learned through self-learning techniques [1]. On the other hand, the relative position method [7] computes the positional relationship as a query to a look-up table that contains learnable parameters. This encoding strategy allows self-attention to consider arbitrary relationships between any two elements within a given range. However, the performance comparison between these two methods has been a subject of debate in some studies [5, 14]. An improved version of the relative position encoding was proposed in [15], suggesting that the relative method can replace the absolute method in specific image classification tasks. Although position embedding has been extensively explored, existing works have primarily used the addition method to inject position information into tokens rather than the concatenation method. However, word and position vectors are distinct physical quantities, and simply adding them together may not be appropriate from a dimensional analysis perspective [16]. Despite this, earlier works did not offer a detailed explanation for choosing the addition method over the concatenation method. After an extensive literature review, we found that only [17, 18] employed concatenation by injecting position embeddings to content features after projecting the outputs of patch embedding, thereby avoiding an excessive increase in channel dimensions. This concatenation strategy differs from the traditional addition strategy, where position embeddings are directly added to the outputs of patch embedding. Therefore, we are the first to replace the addition method with the concatenation method and analyze its impact on model performance. The main questions we aim to explore are: Is the position information learned by the two embedding methods similar? Are addition methods necessarily better for model performance, considering the same number of parameters?

Based on the questions above, this paper aims to investigate the differences in position information learned by the addition and concatenation methods and compare their performances on existing models. As shown in Fig. 1, the token vector’s embedding dimension is halved when using the concatenation method to concatenate the position vector. This ensures that the concatenated tokens retain the same input dimension as the added tokens when passed to the self-attention layer. The effectiveness of both injection approaches is tested on single- and multi-scale models to explore the spatial position information learned by the three positional items in Eqs. (7) and (8). Subsequently, these models are trained using the concatenation method on five downstream datasets with pretraining on the ImageNet-1k dataset to assess the transferability of the two methods. Finally, an ablation study is conducted by injecting the position embeddings into the outputs of patch embedding, query, key, and value separately. The contributions of this paper can be summarized as follows:

We find that the visualized position encodings accessed by the concatenation and addition methods are similar. However, the concatenation method reduces attention distance in the final few layers.
We observe that the concatenation method can learn horizontal, vertical, and angle information, whereas the addition method has limitations in this regard.
We discover that the concatenation method has greater robustness than the addition method, resulting in a gain of +0.1% to +0.5% for the existing models.

This paper investigates two absolute position embedding methods in the existing transformer-based models(concatenation and addition) and finds that the concatenation has greater robustness than the addition method. The remaining sections of the article are structured as follows: Sect. 2 introduces related work on the vision transformers, self-attention, and absolute position embedding. Section 3 presents the methodology. Section 4 provides qualitative and quantitative experimental verification and analysis: image classification, transfer learning, and ablation studies. Section 5 offers a discussion of the results. Finally, Sect. 6 concludes this paper.

2.1 Vision Transformers

The Vision Transformer (ViT) [5] is the pioneering model for image classification that utilizes a pure transformer architecture on large datasets like ImageNet-22k [19] and JFT-300 M [20]. ViT has demonstrated excellent performance compared to state-of-the-art convolutional networks such as ResNets [21] and EfficientNet [22]. ViT [5] divides an image into non-overlapping and ordered patches, then flattened and embedded with class tokens and position information. These patches are mapped to a fixed dimension using a trainable linear projection and passed through multiple transformer encoder layers comprising a multi-head self-attention module (MHSA) and a feed-forward network (FFN). While ViT achieved impressive performance, it required large non-public datasets. To address this, DeiT [23] introduced a student–teacher regime and various training strategies to improve ViT’s performance. Subsequently, CaiT [24], Convit [25], and MaxViT [26] adopted DeiT’s training strategies for their models. However, processing high-resolution feature maps in self-attention calculations is computationally intensive for Vision Transformers. To enhance computational efficiency, the approaches like pyramids [27] were introduced. PvT [9] employed four-scale feature maps for dense prediction tasks, each with a similar architecture consisting of patch embedding and vanilla transformer encoder layers, with differences in depth. Each scaled image also had an independent learnable position encoding. Similarly, LocalViT [28] introduced locality by incorporating depth-wise convolution in the feed-forward network to improve PvT’s performance. In contrast, Swin [7] and Cswin [29] adopted a window-based mechanism to address the computational challenges.

2.2 Self-attention

The self-attention function enables interactions between all elements in the input sequence, allowing them to assess the attention score received by each element. This mechanism involves a set of key-value pairs that explore the input sequence and calculate the sum of attention scores for the values, producing the output sequence. Let’s define the input sequence as $x = (x{_1},\ldots ,x_n)$ with each element $x_i \in \mathbb {R}^{d_x}$, and the output sequence as $z = (z{_1},\ldots ,z_n)$ with each element $z_i \in \mathbb {R}^{d_z}$. The weighted sum of linearly transformed input elements is viewed as each output element $z_i$:

$$\begin{aligned} z_i = \sum \limits _{i=1}^{p}\alpha _{ij}(x_{i}), \end{aligned}$$

(1)

each weight coefficient $\alpha _{ij}$ is computed using a softmax function:

$$\begin{aligned} \alpha _{ij} = \frac{\exp \;{e_{ij}}}{\sum _{k=1}^{n}\exp \;{e_{ik}}}, \end{aligned}$$

(2)

where $e_{ij}$ is computed using a scaled dot product attention [14], which enables efficiently alleviate gradient vanishing problem [30]:

$$\begin{aligned} e_{ij} = \frac{(x_iW_Q)(x_jW_K)^T}{\sqrt{d_z}}, \end{aligned}$$

(3)

given projection $ W_Q,W_K,W_V\in \mathbb {R}^{d_x \times d_z }$ is the learnable matrix representations of queries, keys and values. Each layer of three matrices is unique and usually remains approximately orthogonal to each other when computing.

2.3 Absolute Position Encoding

Transformer-based architectures depart from the use of recurrence and convolution in neural networks. Instead, they rely solely on the self-attention mechanism to establish global dependencies between inputs and outputs. However, this approach also leads to the loss of contextual ordering among input elements [1]. Position embeddings are introduced to address this, allowing each position vector to be injected into the input vector. This assists models in capturing the positional information of input sequences. There are two options for absolute position encodings: one involves manually compiling sine and cosine functions with varying frequencies, while the other allows the system to learn position information independently. Unlike the latter, the former contributes to the model’s generalization capability [14]. The following equations represent the sine and cosine functions with different frequencies [1]:

$$\begin{aligned} PE_{\left( p,2i\right) }= & {} \sin {\left( \frac{p}{10000}^{2i/d_{model}}\right) } \end{aligned}$$

(4)

$$\begin{aligned} PE_{\left( p,{2i+1}\right) }= & {} \cos {\left( \frac{p}{10000}^{2i/d_{model}}\right) }, \end{aligned}$$

(5)

where p is the token position and $i\in [0,\frac{d_{model}}{2}]$, and $d_{model}$ is the embedding dimension of input elements. This positional encoding prompts the model to easily learn to attend by relative position because $PE_{p+k}$ can be represented as a linear function of $PE_p$ for any fixed offset k [1].

However, a study conducted by [4] reveals that the distance-awareness property diminishes when the position encodings $PE_p$ and $PE_{p+k}$ are projected into the learnable high-dimensional space of self-attention. Despite this finding, most existing works [5‐8] utilize the addition method to incorporate position encodings $ p = {p_1,\ldots ,p_n}$ into the input token x, as follows:

$$\begin{aligned} x_i = x_i + p_i, \end{aligned}$$

(6)

where the positional encodings p have the same length as the input token embeddings x and $p_i,x_i \in \mathbb {R}^{d_x}$.

3 Methodology

3.1 Theoretically Analysis and Assumption

The position embedding enables the original self-attention mechanism to capture the ordering relations of input tokens effectively. Here we rewrite the Eq. (3) using the addition method:

$$\begin{aligned} e_{ij}= & {} \frac{((x_i + p_i)W_Q)((x_i + p_i)W_K)^T}{\sqrt{d_z}}\nonumber \\= & {} \frac{x_iW_{QK}x_j^T+p_iW_{QK}x_j^T+x_iW_{QK}p_j^T+p_iW_{QK}p_j^T}{\sqrt{d_z}}, \end{aligned}$$

(7)

In contrast to the initial self-attention item ${x_iW_{QK}x_j^T}$ in the numerator of Eq. (7), the extra three items also focus on positional information in images. The four items indicate how much attention the query token should pay to other tokens in the learnable high-dimensional space (QK). Importantly, we noticed that these four terms are executed simultaneously and do not trade off on one another as they share the QK matrix. Actually, there is no evidence of a relationship between position and tokens, so an independent QK space should be beneficial to model performance. Therefore, we utilize the concatenating method to access four independent weight spaces, where content-to-content, position-to-content, content-to-position, and position-to-position can be independently updated. To maintain similar parameters to the addition method, we redefine the input to be $x^{\prime } = (x_1^{\prime },\ldots ,x_n^{\prime })$ of n elements where $x_i^{\prime }\in \mathbb {R}^{d_x/2}$. The output denotes as $z^{\prime } = (z_1^{\prime },\ldots ,z_n^{\prime })$ of n elements, where $z_i^{\prime } \in \mathbb {R}^{d_z/2}$, and the position is represented $p^{\prime } = (p_1^{\prime },\ldots ,p_n^{\prime })$ of n elements where $p_i^{\prime }\in \mathbb {R}^{d_x/2}$. Based on these definitions, we reformulate the Eq. (3) using the concatenation method:

$$\begin{aligned} e_{ij}= & {} \frac{((x_i^{\prime }\oplus p_i^{\prime })(W_{q_1}\oplus W_{q_2}))((x_j^{\prime }\oplus p_j^{\prime })(W_{k_1}\oplus W_{k_2}))^T}{\sqrt{d_z}}\nonumber \\= & {} \frac{x_i^{\prime }W_{(qk)_1}x{^{\prime }_j}^T+p_i^{\prime }W_{(qk)_2}x{^{\prime }_j}^ T+x_i^{\prime }W_{(qk)_3}p{^{\prime }_j}^T+p_i^{\prime }W_{(qk)_4}p{^{\prime }_j}^T}{\sqrt{d_z}}, \end{aligned}$$

(8)

where the projection $W_{q_1},W_{q_2 },W_{k_1},W_{k_2}\in \mathbb {R}^{d_x\times (d_z/2)}$ is the learnable matrix representations of queries, keys and values when we used the concatenation method. The sign $\oplus $ denotes matrix concatenation. And we extended the first formula in Eq. (8), the matrix of independent sub-attention weight can be obtained as $W_{(qk)_1},W_{(qk)_2},W_{(qk)_3},W_{(qk)_4}\in \mathbb {R}^{d_x\times (d_z/2)}$. If the four sub-attentive matrices can meet $W_{(qk)_1}=W_{(qk)_2}=W_{(qk)_3}=W_{(qk)_4}$, The concatenation method would degenerate into the addition method, where the attention weight matrix is shared. However, it is challenging to train the concatenation method to be equivalent to the addition method when the four sub-weight matrices are initially orthogonal in high dimensions [31]. In existing vision tasks implemented in Python, the initial weights of linear transformations are typically defined using learnable Gaussian random vectors. Additionally, the sub-Gaussian vector is a random vector with independent and isotropic properties [31]. As a result, it is difficult to maintain equivalence between the four sub-QK matrices in the concatenation method.

3.2 Efficient Implementation

When we study the computation of self-attention, it is evident that existing models have different depth concerns. Using residual structures [32] allows position matrices to be consistently propagated to the next layer. Consequently, the input elements for each depth layer change continuously during the computation of the attention weights. To overcome the limitations of single-scale models, we explore several multi-scale models where position encodings are added to each scale. In single-scale models like ViT [6], CaiT [24], and DeiT [23], which are primarily used for image classification, self-attention computations are performed at the same scale throughout the network. As a result, all input patches per layer share the same position matrix p, and the initial input per layer can be expressed as follows:

$$\begin{aligned} x_{d+1}= & {} x_0 + Atten(x_d) + Mlp(x_d)\nonumber \\ p_{d+1}= & {} p, \end{aligned}$$

(9)

where the operator $Atten(\cdot )$ and $Mlp(\cdot )$ is the output of the last self-attention and MLP layer, respectively, and d is the depth number.

In the case of multi-scale models, they extend the vanilla ViT model to accommodate different sizes of feature maps. As a result, these models can be applied to a wider range of downstream tasks, such as object detection and semantic segmentation, rather than being limited to image classification [9]. When studying the impact of position embeddings on multi-scale models, we investigate models that utilize different scale position matrices. Each stage requires the re-embedding of a new-scale position matrix. Therefore, the Eq. (9) is reformulated as follows:

$$\begin{aligned} x_{d+1}= & {} x_0 + Atten(x_d^s) + Mlp(x_d^s)\nonumber \\ p_{d+1}= & {} p^s, \end{aligned}$$

(10)

where $s \in [0,4)$ is the stage number. We compute the relationship of word and position vectors per depth by adding Eqs. (9) and (10) into Eqs. (7) and (8), respectively.

4 Experiment

In this section, we begin by conducting experiments to illustrate the differences between two embedding methods across various models [23‐25, 28] using the ImageNet-1K dataset [19]. Subsequently, we perform transfer learning on downstream datasets, including Cifar10 [33], Cifar100 [34], Oxford-flowers102 [35], Pets [36], and Stanford Cars [37], using pre-trained models on ImageNet-1K. This is done to examine the transferability of the two embedding methods. Finally, we conduct ablation studies to assess the impact of injecting the position embeddings at different positions on the performance of the models.

Table 1

Comparison of two embedding methods on different models on ImageNet-1k dataset [19]

Models	Param.$\downarrow $	FLOPs$\downarrow $	Mem.$\downarrow $	Speed$\uparrow $	Top-1 (Add./Concat.)$\uparrow $
Models	(M)	(G)	(G)	(im/s)	(%)
DeiT-S [23]	22.0/21.6	4.23/4.22	6.77/6.96	986.37/916.29	79.8/80.0(+0.2)
CaiT-XXS36 [24]	17.2/17.3	3.24/3.22	14.12/14.13	522.84/543.09	79.1/79.3(+0.2)
ConViT-S [25]	27.7/27.5	5.36/5.32	12.26/12.25	600.84/620.23	81.3/81.6(+0.3)
VoLo-D1 [38]	26.6/26.7	6.52/6.53	14.01/14.03	511.35/526.45	84.2/84.2
TnT-S [39]	23.8/23.6	4.85/4.83	13.47/13.45	466.92/467.39	81.5/81.4($-$0.1)
PvT-S [9]	24.5/23.9	3.69/3.64	10.50/9.99	854.09/864.26	79.8/80.3(+0.5)
LocalViT-PvT [28]	13.5/12.6	1.96/1.91	17.89/17.52	768.09/773.06	78.2/78.5(+0.3)
DeiT-B [23]	86.6/85.8	16.86/16.81	14.30/14.29	310.78/311.10	81.8/82.2(+0.4)
CaiT-XS36 [24]	38.5/38.4	7.25/7.23	22.16/21.36	330.80/344.26	82.6/82.6
ConViT-B [25]	86.4/86.1	16.81/16.75	17.51/17.50	256.20/262.41	82.2/82.2
VoLo-D2 [38]	58.6/58.7	13.61/13.63	23.98/23.98	266.24/273.19	85.2/85.3(+0.1)
TnT-B [39]	65.4/65.1	13.44/13.40	21.68/21.65	264.99/265.12	82.9/82.8($-$0.1)
PvT-M [9]	44.2/43.6	6.47/6.41	14.54/14.26	542.86/546.84	81.2/81.5(+0.3)

All models are trained on the default input size of $224 \times 224$. The parameters, FLOPs, and Speed are evaluated on V100 with a batch size of 256. The speed represents the throughput during inference. The Mem. denotes the maximum GPU memory allocated since the beginning of the program in PyTorch, with a batch size of 128

4.1 Image Classification

Training: We train both single- and multi-scale models using different absolute position embedding methods on ImageNet-1k, which consists of 1000 classes for image classification. The dataset comprises 1.28 million training images and 50k validation images. We adhere to the same training strategies as DeiT [23] for all experiments to ensure a fair comparison. This strategy employs the AdamW [40] optimizer with a weight decay of 0.05 and a SGD momentum 0.9. The initial learning rate is set to $1\times 10^{-3}$ with five epochs of warmup, followed by a decrease according to the cosine schedule [41] until reaching a minimum learning rate of $1\times 10^{-5}$. We adopt the same data augmentation and regularization methods as DeiT [23]. Furthermore, the models with the concatenation method are trained from scratch for 300 epochs, with a batch size of 1024 on 8 V100 GPUs. Random crops are applied during training, while center crops are used during validation to resize input images to $224 \times 224$.

Results: Table 1 illustrates that the parameters and FLOPs of models using the concatenation and addition methods remain similar. This can be attributed to the reduction of the embedding dimension by half before concatenating the position encodings with the content features. However, the two embedding methods can impact the maximum GPU memory during model training. For example, using the concatenation method increases the memory usage of DeiT-S [23] and CaiT-XS36 [24] by 0.19G and 0.2G, respectively. On the other hand, PvT-S [9], PvT-M, and LocalViT-PvT experience a decrease in memory usage of 0.51G, 0.28G, and 0.37G, respectively, when using the concatenation method. In terms of throughput, except for DeiT-S, the concatenation method improves the performance of the listed models. For instance, Cait-XXS36 achieves a throughput increase from 522.84 to 543.09 im/s when using the concatenation method despite requiring an additional 0.1M parameters compared to the addition method. Similarly, ConViT-S [25] shows an improvement of 19.39 im/s with the concatenation method. We reduce the embedding dimension at the patch embedding stage to ensure a similar number of parameters. Lower embedding dimensions theoretically hurt model performance [7, 9, 23]. However, our experiments demonstrate that the concatenation method positively affects the top-1 accuracy of many models, possibly because these models are still sparse. In contrast to the addition method, the concatenation method leads to a gain of +0.4% and +0.3% for DeiT-B and CaiT-XS46, respectively. It also improves the performance by +0.5%, +0.3%, and +0.3% for PvT-S, PvT-M, and LocalViT-PvT [28], respectively. For Volo-D1 [38], CaiT-XS36, and ConViT-B, the two embedding methods yield the same model performance. Furthermore, we observed that the concatenation method decreases the top-1 accuracy of the TnT series [39] by $-$ 0.1%. This is because the reduced dimensions have a negative impact on the pixel embedding component in the TnT model. While traditional transformer models only have a patch embedding process before entering the transformer block, the TnT series involves an interaction between pixel and patch embedding. Therefore, the influence of the addition and concatenation methods on model performance depends more on the network structure.

Moreover, we conducted a comparison of the average attention distance between models using the two embedding methods, as shown in Fig. 2. We observed that the concatenation method can reduce the attention distance in the last few layers. In particular, the attention distance of DeiT-S significantly decreases, as indicated by the light blue rectangle in Fig. 2a, when the concatenation method is employed. This reduction in attention distance is beneficial for capturing more local information [42]. It also suggests that the concatenation method positively influences information extraction. Additionally, Fig. 3 illustrates that both embedding methods can accurately attend to recognized features in images. However, we also noticed that the concatenation method can attend to multiple identifiable features, as depicted by the pink marked rectangle, compared to the addition method. In summary, the concatenation method can be utilized for model training and effectively improves the throughput of models compared to the addition method while also exhibiting greater robustness.

4.2 Transfer Learning

As mentioned earlier, the concatenation method has shown a slight improvement in the performance of several models compared to the addition method. However, it is important to evaluate the transferability of these two methods. To address this, we conducted experiments using the listed models pre-trained on ImageNet-K [19] to train five downstream classification tasks for 300 epochs. These datasets include Cifar10 [33], Cifar100 [34], Flowers102 [35], Cars [37], and Pets [36], as shown in Table 2. All training was performed from scratch on a V100 GPU with an input size of $224\times 224$. Based on the results presented in Table 3, we found that the transferability of models using the concatenation method is similar to that of models using the addition method. Specifically, the model CaiT-XXS36 [24] and ConViT-S [25], when using the concatenation method, achieved improvements on the five downstream datasets compared to the addition method. Furthermore, DeiT-S [23] showed a performance gain of +0.3% with the concatenation method compared to the addition method. Similarly, PvT-S [9] exhibited a gain of +0.3% with the concatenation method compared to the addition method. Additionally, the performance of VoLo-D1 [38] on these datasets was relatively similar when using both embedding approaches. In conclusion, the models using the concatenation method demonstrate robustness in transfer learning tasks, similar to the addition method.

Table 2

Datasets used in downstream image classification tasks with ImageNet-1k pre-training

Datasets	Domain	Input size	Train size	Test size	Classes
ImageNet-1k [19]	Mixed	Various	1,281,167	50,000	1000
Cifar100 [34]	Mixed	$32\times 32$	50,000	10,000	100
Cifar10 [33]	Mixed	$32\times 32$	50,000	10,000	10
Flowers102 [35]	Flowers	Various	2040	6149	102
Cars [37]	Cars	Various	8144	8,041	196
Pets [36]	Dogs and cats	Various	3.698	3695	37

The top three datasets are used for generic classification tasks, while the bottom three datasets are specifically utilized for fine-grained tasks

Table 3

Comparison of Transformers-based models with two embedding methods on downstream classification tasks using ImageNet-1k pre-training

Models	Methods	Param.(M)	ImageNet-1k	Cifar10	Cifar100	Flowers102	Cars	Pets
DeiT-S [23]	Add	22.0	79.8	98.6	87.5	99.2	92.0	95.5
DeiT-S [23]	Concat	21.6	80.0	98.6	87.8	99.1	92.1	95.7
CaiT-XXS36 [24]	Add	17.2	79.1	98.8	88.8	99.1	92.3	95.8
CaiT-XXS36 [24]	Concat	17.3	79.3	98.9	89.0	99.3	92.5	96.0
ConViT-S [25]	Add	27.7	81.3	98.5	89.1	99.2	92.2	95.8
ConViT-S [25]	Concat	27.5	81.6	98.6	89.3	99.3	92.5	96.1
PvT-S [9]	Add	24.5	79.8	98.3	81.7	99.1	91.5	95.8
PvT-S [9]	Concat	23.9	80.3	98.4	82.0	98.9	91.8	96.0
TnT-S [39]	Add	23.8	81.5	98.6	88.6	99.2	92.8	95.9
TnT-S [39]	Concat	23.6	81.4	98.6	88.6	99.1	92.6	95.8
VoLo-D1 [38]	Add	26.6	84.2	98.9	89.8	99.4	93.4	96.7
VoLo-D1 [38]	Concat	26.7	84.2	98.9	89.9	99.4	93.5	96.7

All models with an input size of $224 \times 224$ were trained uniformly on a V100 GPU

4.3 Ablation Studies

To further investigate the differences between the two embedding methods, we performed an ablation study on the embedding position in the vanilla vision transformer with a depth of 12. The position encodings were injected into different locations, including the outputs of patch embedding (common), query, key, and value, using both the concatenation and addition methods. These models were trained for 300 epochs on the Flowers102 dataset [35] from scratch.

As depicted in Fig. 4, the heat maps generated by the concatenation method are noticeably better than those generated by the addition method for the input image1. Specifically, when we replaced the addition with the concatenation in the common position for the input Images, the attention area of the salient object became larger. The attention matrices clearly demonstrate the significant differences in weights. This finding also explains why the concatenation method slightly improves the performance of several models, as shown in Table 1. Furthermore, the heat maps differ when we inject the position information into different tokens in the self-attention layer. The attention matrices of value+position outperform those of query+position and key+position for the input image1 and image2, resulting in better heat maps. Additionally, we compared the similarity of attention matrices between the two embedding methods, as shown in Fig. 5. We observed that the similarity curve varies when the position encodings are plugged into the query and key tokens. However, the curve is approximately similar when we incorporate the position information into the outputs of patch embedding and value tokens using both methods. In conclusion, the concatenation method generates larger attention areas on objects than the addition method. The attention matrices are similar when we concatenate/add the position information to the outputs of patch embedding.

5 Discussion

5.1 Analysis of Learnable Position Encodings

As depicted in Fig. 6, the position encodings learned by the single-scale models using the two embedding methods exhibit remarkable similarity. The similarity values for the DeiT-S [23] and CaiT-XXS36 [24] models are 0.93 and 0.96, respectively. Additionally, the position information obtained through concatenation appears to be more centralized compared to the addition method in single-scale models. When utilizing the concatenation method, the color area resembling the highlighted spot (queried position) is smaller, as indicated by the sky-blue marked area in the top two rows of Fig. 6. In the case of multi-scale models, the similarity values of the learned position information are relatively lower, reaching 0.89 and 0.82 for the LocalViT-PvT [28] and PvT-S [9] models, as shown in the bottom curve graph. The concatenation curve appears smoother than the addition curve. Analyzing the visualized maps on the left, we observe that the learned attention positions using the addition method are scattered across the entire map. In contrast, those using the concatenation method converge towards the highlighted spot. Furthermore, the sine-cosine hand-crafted position information fails to present a distinct difference between the queried position and other positions in the four curve graphs. Ideally, we would like to see noticeable differences between each queried token (highlighted spot) and other tokens to achieve better model performance. In summary, the concatenation method can learn better position relations than the addition method.

5.2 Spatial Information Representation

In Fig. 7, we observe that the concatenation method enhances the learning of three positional items in the Eq. (8): $p_i^{\prime }W_{(qk)_2}x{^{\prime }_j}^T,x_i^{\prime }W_{(qk)_3}p{^{\prime }_j}^T,p_i^{\prime }W_{(qk)_4}p{^{\prime }_j}^T$, enabling them to capture horizontal, vertical, and angular position information. On the other hand, the addition method only learns the angle information from these three items. However, Fig. 7d and h demonstrate that the angle position still plays a crucial role in the final position determination within the self-attention layer, regardless of whether the addition or concatenation method is used. In Fig. 8, we also observe that models using the concatenation method can learn detailed spatial information, including horizontal, vertical, and angular positions. In contrast, in the case of the single-scale model, the addition method tends to blur the horizontal and vertical positions, relying more on the vertical information for position determination. Consequently, we conclude that transformer models can learn concrete spatial positions by utilizing the three positional items within the self-attention layer, particularly after concatenating the absolute position encodings. Furthermore, the addition and concatenation methods can learn the same type of position decision within a single model. As depicted in Fig. 9b, the position information learned by each token exhibits apparent periodic fluctuations for the addition method. Additionally, the scale range of the learned position attention is between $-300$ and 400, which is much broader than the scale range observed with the concatenation method. The reason for this difference is that the concatenation method can learn multiple types of spatial information, while the addition method relies on only one type. Hence, the addition method requires a broader scale to distinguish the unique information it learns. In the case of the multi-scale model, the range difference of this scale between Fig. 9c and d is significantly reduced because the addition method learns two types of position information.

6 Conclusion

This paper investigates the differences between the concatenation and addition methods for learnable absolute position encodings in existing transformer-based models. Through experiments, we discover that the concatenation method enables simultaneous attention to horizontal, vertical, and angular positions, while the addition method only focuses on a specific position. Additionally, the concatenation method attains a larger attention area toward objects in images and reduces the attention distance in the final layers of the model. These advantages contribute to a performance gain of 0.1 to 0.5% for the models and achieve comparable performance on transfer learning tasks compared to the addition method. Consequently, further research on the effect of the concatenation method on model performance is warranted.

Acknowledgements

The project is supported by the Natural Science Foundation of China (No. 52275091), Natural Science Foundation of Liaoning Province (No. 2022-MS-125), and Fundamental Research Funds for the Central Universities (No. N2303011)

Declarations

Conflict of interest

No conflict of interest.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

next article A Hybrid Model Based on Convolutional Neural Network and Long Short-Term Memory for Multi-label Text Classification

Position Embedings of Models

In Fig. 10, we illustrate the learning of spatial position representation using the addition and concatenation methods for patch-level inputs ($14\times 14$). In the single-scale model, both embedding methods demonstrate similar capabilities in learning position relations, although the concatenation method narrows down the highlighted area. In the multi-scale model, the advantage of the concatenation method becomes more prominent as the position information converges towards the highlighted spot. It is important to note that the highlighted spot represents the position of a queried token. Therefore, the narrowing of the highlighted area allows for maximizing the distinction between the queried position and other positions.

Convergence of Each Layer for Different-Scale Models

As depicted Fig. 11, we conducted an investigation on the vector product of four items in the numerator of Eqs. (7) and (8) for each layer after testing 500 images with a pre-trained model. The results obtained from the two embedding methods demonstrate that the three positional items ($p_iW_{QK}x_j^T,x_iW_{QK}p_j^T,p_iW_{QK}p_j^T$) gradually converge from the first to the last layer and exhibit fluctuations around zero at the final layer. This phenomenon is also observed in the multi-scale model. Consequently, the experiment provides evidence that both the adding and concatenating methods play similar roles in the system.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems (NeurIPS), vol 20

Devlin J, Chang MW, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp 4171–4186

Dai Z, Yang Z, Yang Y, Carbonell J, Le QV, Salakhutdinov R (2019) Transformer-xl: attentive language models beyond a fixed-length context. In: ACL, vol 1

Yan H, Deng B, Li X, Qiu X (2019) TENER: adapting transformer encoder for named entity recognition. arXiv:1911.04474

Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Houlsby N (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv:2010.11929

Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European conference on computer vision (ECCV), pp 213–229

Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, Guo B (2021) Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 10012–10022

Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Zhang L (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 6881–6890

Wang W, Xie E, Li X, Fan DP, Song K, Liang D, Shao L (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 568–578

10.

Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, Shen C (2021) Twins: revisiting the design of spatial attention in vision transformers. In: Advances in neural information processing systems (NeurIPS), vol 34, pp 9355–9366

11.

Wu H, Xiao B, Codella N, Liu M, Dai X, Yuan L, Zhang L (2021) Cvt: introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 22–31

12.

Ke G, He D, Liu TY (2020) Rethinking positional encoding in language pre-training. arXiv:2006.15595

13.

Su J, Lu Y, Pan S, Wen B, Liu Y (2021) RoFormer: enhanced transformer with rotary position embedding. arXiv:2104.09864

14.

Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. arXiv:1803.02155

15.

Wu K, Peng H, Chen M, Fu J, Chao H (2021) Rethinking and improving relative position encoding for vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp 10033–10041

16.

Bowers BJ, Schatzman L (2021) Dimensional analysis. In: Developing grounded theory. Routledge, New York, pp 111–129

17.

Meng D, Chen X, Fan Z, Zeng G, Li H, Yuan Y, Sun L, Wang J (2021) Conditional detr for fast training convergence. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3651–3660

18.

Shi S, Jiang L, Dai D, Schiele B (2022) Motion transformer with global intention localization and local movement refinement. Adv Neural Inf Process Syst 35:6531–6543

19.

Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. IEEE Comput Vis Pattern Recognit (CVPR) 248–255

20.

Sun C, Shrivastava A, Singh S, Gupta A (2017) Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE international conference on computer vision, pp 843–852

21.

Kolesnikov A, Beyer L, Zhai X, Puigcerver J, Yung J, Gelly S, Houlsby N (2020) Big transfer (bit): general visual representation learning. In: European conference on computer vision (ECCV), pp 491–507

22.

Xie Q, Luong MT, Hovy E, Le QV (2020) Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10687–10698

23.

Touvron H, Cord M, Douze M, Massa F, Sablayrolles A, Jégou H (2021) Training data-efficient image transformers and distillation through attention. In: International conference on machine learning (PMLR), pp 10347–10537

24.

Touvron H, Cord M, Sablayrolles A, Synnaeve G, Jégou H (2021) Going deeper with image transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 32–42

25.

d’Ascoli S, Touvron H, Leavitt ML, Morcos AS, Biroli G, Sagun L (2021) Convit: Improving vision transformers with soft convolutional inductive biases. In: International conference on machine learning, pp 2286–2296

26.

Tu Z, Talebi H, Zhang H, Yang F, Milanfar P, Bovik A, Li Y (2022) Maxvit: multi-axis vision transformer. In: European conference on computer vision, pp 459–479

27.

Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2117–2125

28.

Li Y, Zhang K, Cao J, Timofte R, Van Gool L (2021) Localvit: Bringing locality to vision transformers . arXiv:2104.05707

29.

Dong X, Bao J, Chen D, Zhang W, Yu N, Yuan L, Chen D, Guo B (2022) Cswin transformer: A general vision transformer backbone with cross-shaped windows. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 12124–12134

30.

Lin T, Wang Y, Liu X, Qiu X (2021) A survey of transformers. arXiv:2106.04554

31.

Vershynin R (2018) High-dimensional probability: an introduction with applications in data science. Cambridge University Press, CambridgeCrossRef

32.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778

33.

Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 84–90

34.

Recht B, Roelofs R, Schmidt L, Shankar V (2018) Do cifar-10 classifiers generalize to cifar-10?. arXiv:1806.00451

35.

Nilsback ME, Zisserman A (2008) Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics and image processing, pp. 722–729

36.

Parkhi OM, Vedaldi A, Zisserman A, Jawahar CV (2012) Cats and dogs. In: 2012 IEEE conference on computer vision and pattern recognition, pp 3498–3505

37.

Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of the IEEE international conference on computer vision workshops, pp 554–561

38.

Yuan L, Hou Q, Jiang Z, Feng J, Yan S (2022) Volo: Vision outlooker for visual recognition. IEEE Trans Pattern Anal Mach Intell 45:6575–6586

39.

Han K, Xiao A, Wu E, Guo J, Xu C, Wang Y (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919

40.

Loshchilov I, Hutter F (2017) Decoupled weight decay regularization. arXiv:1711.05101

41.

Loshchilov I, Hutter F (2016) SGDR: stochastic gradi- ent descent with warm restarts. arXiv:1608.03983

42.

Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2022) Transformers in vision: a survey. ACM Comput Surv (CSUR) 1–41

Title: Rethinking Position Embedding Methods in the Transformer Architecture
Authors: Xin Zhou
Zhaohui Ren
Shihua Zhou
Zeyu Jiang
TianZhuang Yu
Hengfa Luo
Publication date: 01-04-2024
Publisher: Springer US
Published in: Neural Processing Letters / Issue 2/2024
Print ISSN: 1370-4621
Electronic ISSN: 1573-773X
DOI: https://doi.org/10.1007/s11063-024-11539-7

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Related Work

2.1 Vision Transformers

2.2 Self-attention

2.3 Absolute Position Encoding

3 Methodology

3.1 Theoretically Analysis and Assumption

3.2 Efficient Implementation

4 Experiment

4.1 Image Classification

4.2 Transfer Learning

4.3 Ablation Studies

5 Discussion

5.1 Analysis of Learnable Position Encodings

5.2 Spatial Information Representation

6 Conclusion

Acknowledgements

Declarations

Conflict of interest

Publisher's Note

Position Embedings of Models

Convergence of Each Layer for Different-Scale Models

Other articles of this Issue 2/2024

Central Attention with Multi-Graphs for Image Annotation

APRE: Annotation-Aware Prompt-Tuning for Relation Extraction

Multi-back-propagation Algorithm for Signal Neural Network Decomposition

Adaptive Evolutionary Reinforcement Learning with Policy Direction

A Domain Adaptive Semantic Segmentation Method Using Contrastive Learning and Data Augmentation

PEB-TAXO: Projecting Entities as Boxes for Taxonomy Expansion