nach oben

International Journal of Computer Vision

Erschienen in:

09.06.2023 | Manuscript

What Limits the Performance of Local Self-attention?

verfasst von: Jingkai Zhou, Pichao Wang, Jiasheng Tang, Fan Wang, Qiong Liu, Hao Li, Rong Jin

Erschienen in: International Journal of Computer Vision | Ausgabe 10/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Although self-attention is powerful in modeling long-range dependencies, the performance of local self-attention (LSA) is just similar to depth-wise convolution, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what limits the performance of LSA. To clarify these, we comprehensively investigate LSA and its counterparts from channel setting and spatial processing. We find that the devil lies in attention generation and application, where relative position embedding and neighboring filter application are key factors. Based on these findings, we propose enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring area, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture/hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer by up to \(+\)1.4 on top-1 accuracy. ELSA also consistently benefits VOLO from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to \(+\)1.9 box Ap/\(+\)1.3 mask Ap on the COCO, and by up to \(+\)1.9 mIoU on the ADE20K.

Vorheriger Artikel Importance First: Generating Scene Graph of Human Interest

Nächster Artikel When Multi-Focus Image Fusion Networks Meet Traditional Edge-Preservation Technology

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.

Cai, Z., & Vasconcelos, N. (2018). Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6154—6162).

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S (2020). End-to-end object detection with transformers. In ECCV.

Chen, B., Li, P., Li, B., Li, C., Bai, L., Lin, C., Sun, M., Yan, J., & Ouyang, W. (2021a). Psvit: Better vision transformer via token pooling and attention sharing. arXiv preprint arXiv:2108.03428

Chen, C. F., Panda, R., & Fan, Q. (2021b). Regionvit: Regional-to-local attention for vision transformers. arXiv preprint arXiv:2106.02689

Chen, J., Wang, X., Guo, Z., Zhang, X., & Sun, J. (2021c). Dynamic region-aware convolution. In CVPR.

Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019) MMDetection: Openmmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155

Chen, Y., Dai, X., Liu, M., Chen, D., Yuan, L., & Liu, Z. (2020). Dynamic convolution: Attention over convolution kernels. In CVPR.

Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., & Liu, Z. (2022) Mobile-former: Bridging mobilenet and transformer. In CVPR.

Chen, Z., Xie, L., Niu, J., Liu, X., Wei, L., & Tian, Q. (2021d). Visformer: The vision-friendly transformer. In ICCV.

Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., & Tang, M. (2021e). Dpt: Deformable patch-based transformer for visual recognition. In ACM MM, pp 2899–2907

Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In PAMI.

Chu, X., Zhang, B., Tian, Z., Wei, X., & Xia, H. (2021). Do we really need explicit position encodings for vision transformers? arXiv preprint arXiv:2102.10882

Contributors, M. (2020). MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation

Dai, Z., Liu, H., Le, Q.V., & Tan, M. (2021) Coatnet: Marrying convolution and attention for all data sizes. In NeurIPS

Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.

Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., & Guo, B. (2021). Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR.

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR

El-Nouby, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., & Verbeek, J. (2021) Xcit: Cross-covariance image transformers. In NeurIPS.

Fang, Y., Wang, X., Wu, R., & Liu, W. (2021). What makes for hierarchical vision transformer? arXiv preprint arXiv:2107.02174

Gong, C., Wang, D., Li, M., Chandra, V., & Liu, Q. (2021). Improve vision transformers training by suppressing over-smoothing. arXiv preprint arXiv:2104.12753

Graham, B., El-Nouby, A., Touvron, H., Stock, P., Joulin, A., Jégou, H., & Douze, M. (2021). Levit: A vision transformer in convnet’s clothing for faster inference. In ICCV.

Gu, J., Kwon, H., Wang, D., Ye, W., Li, M., Chen, Y.H., Lai, L., Chandra, V., & Pan, D. Z. (2021). Hrvit: Multi-scale high-resolution vision transformer. arXiv preprint arXiv:2111.01236

Guo, J., Han, K., Wu, H., Xu, C., Tang, Y., Xu, C., & Wang, Y. (2022). Cmt: Convolutional neural networks meet vision transformers. In CVPR

Han, Q., Fan, Z., Dai, Q., Sun, L., Cheng, M. M., Liu, J., & Wang, J. (2022). On the connection between local attention and dynamic depth-wise convolution. In ICLR

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In CVPR.

He, S., Luo, H., Wang, P., Wang, F., Li, H., & Jiang, W. (2021). Transreid: Transformer-based object re-identification. In ICCV.

Howard, A., Sandler, M., Chu, G., Chen, L. C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., & Lee, Q. V. (2019). Searching for mobilenetv3. In ICCV.

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861

Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., & Fu, B. (2021). Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650

Jia, X., De Brabandere, B., Tuytelaars, T., & Gool, L. V. (2016). Dynamic filter networks. In NeurIPS.

Jiang, Z., Hou, Q., Yuan, L., Zhou, D., Jin, X., Wang, A., & Feng, J. (2021). Token labeling: Training a 85.5% top-1 accuracy vision transformer with 56m parameters on imagenet. arXiv preprint arXiv:2104.10858

Li, D., Hu, J., Wang, C., Li, X., She, Q., Zhu, L., Zhang, T., & Chen, Q. (2021a). Involution: Inverting the inherence of convolution for visual recognition. In CVPR.

Li, J., Yan, Y., Liao, S., Yang, X., & Shao, L. (2021b). Local-to-global self-attention in vision transformers. arXiv preprint arXiv:2107.04735

Li, S., Chen, X., He, D., & Hsieh, C. J. (2021c). Can vision transformers perform convolution? arXiv preprint arXiv:2111.01353

Lin, M., & Ye, J. (2016). A non-convex one-pass framework for generalized factorization machine and rank-one matrix sensing. In NeurIPS.

Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.

Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., & Wei, F. (2022a). Swin transformer v2: Scaling up capacity and resolution. In CVPR.

Liu, Z., Mao, H., Wu, C. Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022b). A convnet for the 2020s. In CVPR.

Loshchilov, I., & Hutter, F. (2017) Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101

Lu, J., Yao, J., Zhang, J., Zhu, X., Xu, H., Gao, W., Xu, C., Xiang, T., & Zhang, L. (2021). Soft: Softmax-free transformer with linear complexity. In NeurIPS.

Ma, N., Zhang, X., Huang, J., & Sun, J. (2020). Weightnet: Revisiting the design space of weight networks. In ECCV.

Mehta, S., & Rastegari, M. (2021). Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. arXiv preprint arXiv:2110.02178

Pan, Z., Zhuang, B., He, H., Liu, J., Cai, J. (2022). Less is more: Pay less attention in vision transformers. In AAAI.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., & Desmaison, A. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.

Polyak, B. T., & Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4), 838–855.MathSciNetCrossRefMATH

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., & Dosovitskiy, A. (2021). Do vision transformers see like convolutional neural networks? In NeurIPS

Rao, Y., Zhao, W., Tang, Y., Zhou, J., Lim, S. N., & Lu, J. (2022) Hornet: Efficient high-order spatial interactions with recursive gated convolutions. NeurIPS.

Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. C. (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR

Su, H., Jampani, V., Sun, D., Gallo, O., Learned-Miller, E., & Kautz, J. (2019). Pixel-adaptive convolutional neural networks. In CVPR.

Sun, P., Cao, J., Jiang, Y., Zhang, R., Xie, E., Yuan, Z., Wang, C., & Luo, P. (2020). Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460

Tan, M., Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML.

Tarvainen, A., & Valpola, H. (2017). Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS.

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In ICML

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS.

Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., & Shlens, J. (2021). Scaling local self-attention for parameter efficient visual backbones. In CVPR.

Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C. C., & Lin, D. (2019). Carafe: Content-aware reassembly of features. In ICCV.

Wang, J., Chen, K., Xu, R., Liu, Z., Loy, C. C , & Lin, D. (2021a). Carafe++: Unified content-aware reassembly of features. PAMI

Wang, P., Wang, X., Luo, H., Zhou, J., Zhou, Z., Wang, F., Li, H., & Jin, R. (2022a). Scaled relu matters for training vision transformers. In AAAI.

Wang, P., Wang, X., Wang, F., Lin, M., Chang, S,. Xie, W., Li, H., & Jin, R. (2022b) Kvt: K-nn attention for boosting vision transformers. In ECCV.

Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021b). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV

Wang, W., Yao, L., Chen, L., Cai, D., He, X., & Liu, W. (2021c). Crossformer: A versatile vision transformer based on cross-scale attention. arXiv preprint arXiv:2108.00154

Wightman, R. (2019). Pytorch image models. https://github.com/rwightman/pytorch-image-models, https://doi.org/10.5281/zenodo.4414861

Wu, K., Peng, H., Chen, M., Fu, J., & Chao, H. (2021). Rethinking and improving relative position encoding for vision transformer. In ICCV.

Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In ECCV

Xiao, T., Singh, M., Mintun, E., Darrell, T., Dollár, P., & Girshick, R. (2021). Early convolutions help transformers see better. In NeurIPS.

Yang, B., Bender, G., Le, Q. V., & Ngiam, J. (2019). Condconv: Conditionally parameterized convolutions for efficient inference. In NeurIPS.

Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., & Gao, J. (2021). Focal self-attention for local-global interactions in vision transformers. arXiv preprint arXiv:2107.00641

Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z., Tay, F. E., Feng, J., & Yan, S. (2021a). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV

Yuan, L., Hou, Q., Jiang, Z., Feng, J., & Yan, S. (2022). Volo: Vision outlooker for visual recognition. PAMI.

Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021b). Hrformer: High-resolution transformer for dense prediction. In NeurIPS.

Yun, S., Han, D., Oh, S. J., Chun, S., Choe, J., & Yoo, Y. (2019). Cutmix: Regularization strategy to train strong classifiers with localizable features. In CVPR.

Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). Mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412

Zhang, P., Dai, X., Yang, J., Xiao, B., Yuan, L., Zhang, L., & Gao, J. (2021). Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. In ICCV

Zhang, Y., Zhang, J., Wang, Q., & Zhong, Z. (2020). Dynet: Dynamic convolution for accelerating convolutional neural networks. arXiv preprint arXiv:2004.10694

Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P.H., & Zhang, L. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. (2017) Scene parsing through ade20k dataset. In CVPR

Zhou, D., Shi, Y., Kang, B., Yu, W., Jiang, Z., Li, Y., Jin, X., Hou, Q., & Feng, J. (2021a). Refiner: Refining self-attention for vision transformers. arXiv preprint arXiv:2106.03714

Zhou, J., Jampani, V., Pi, Z., Liu, Q., & Yang, M. H. (2021b). Decoupled dynamic filter networks. In CVPR.

Titel: What Limits the Performance of Local Self-attention?
verfasst von: Jingkai Zhou
Pichao Wang
Jiasheng Tang
Fan Wang
Qiong Liu
Hao Li
Rong Jin
Publikationsdatum: 09.06.2023
Verlag: Springer US
Erschienen in: International Journal of Computer Vision / Ausgabe 10/2023
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI: https://doi.org/10.1007/s11263-023-01813-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 10/2023

Pyramid NeRF: Frequency Guided Fast Radiance Field Optimization

Don’t Be So Dense: Sparse-to-Sparse GAN Training Without Sacrificing Performance

Importance First: Generating Scene Graph of Human Interest

Full-Spectrum Out-of-Distribution Detection

DOVE: Learning Deformable 3D Objects by Watching Videos

Conditional Temporal Variational AutoEncoder for Action Video Prediction

Premium Partner