Skip to main content

02.04.2024

EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm

verfasst von: Jiangning Zhang, Xiangtai Li, Yabiao Wang, Chengjie Wang, Yibo Yang, Yong Liu, Dacheng Tao

Erschienen in: International Journal of Computer Vision

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Motivated by biological evolution, this paper explains the rationality of Vision Transformer by analogy with the proven practical evolutionary algorithm (EA) and derives that both have consistent mathematical formulation. Then inspired by effective EA variants, we propose a novel pyramid EATFormer backbone that only contains the proposed EA-based transformer (EAT) block, which consists of three residual parts, i.e., Multi-scale region aggregation, global and local interaction, and feed-forward network modules, to model multi-scale, interactive, and individual information separately. Moreover, we design a task-related head docked with transformer backbone to complete final information fusion more flexibly and improve a modulated deformable MSA to dynamically model irregular locations. Massive quantitative and quantitative experiments on image classification, downstream tasks, and explanatory experiments demonstrate the effectiveness and superiority of our approach over state-of-the-art methods. E.g., our Mobile (1.8 M), Tiny (6.1 M), Small (24.3 M), and Base (49.0 M) models achieve 69.4, 78.4, 83.1, and 83.9 Top-1 only trained on ImageNet-1K with naive training recipe; EATFormer-Tiny/Small/Base armed Mask-R-CNN obtain 45.4/47.4/49.0 box AP and 41.4/42.9/44.2 mask AP on COCO detection, surpassing contemporary MPViT-T, Swin-T, and Swin-S by 0.6/1.4/0.5 box AP and 0.4/1.3/0.9 mask AP separately with less FLOPs; Our EATFormer-Small/Base achieve 47.3/49.3 mIoU on ADE20K by Upernet that exceeds Swin-T/S by 2.8/1.7. Code is available at https://​github.​com/​zhangzjn/​EATFormer.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., & Jegou H (2021). Xcit: Cross-covariance image transformers. In NeurIPS. Ali, A., Touvron, H., Caron, M., Bojanowski, P., Douze, M., Joulin, A., Laptev, I., Neverova, N., Synnaeve, G., Verbeek, J., & Jegou H (2021). Xcit: Cross-covariance image transformers. In NeurIPS.
Zurück zum Zitat Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. In ICML. Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. In ICML.
Zurück zum Zitat Bao, H., Dong, L., Piao, S., & Wei, F. (2022). BEit: BERT pre-training of image transformers. In ICLR. Bao, H., Dong, L., Piao, S., & Wei, F. (2022). BEit: BERT pre-training of image transformers. In ICLR.
Zurück zum Zitat Bartz-Beielstein, T., Branke, J., Mehnen, J., & Mersmann, O. (2014). Evolutionary algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. Bartz-Beielstein, T., Branke, J., Mehnen, J., & Mersmann, O. (2014). Evolutionary algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery.
Zurück zum Zitat Bello, I. (2021). Lambdanetworks: Modeling long-range interactions without attention. In ICLR. Bello, I. (2021). Lambdanetworks: Modeling long-range interactions without attention. In ICLR.
Zurück zum Zitat Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In ICML. Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding? In ICML.
Zurück zum Zitat Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., & Veit, A. (2021). Understanding robustness of transformers for image classification. In ICCV. Bhojanapalli, S., Chakrabarti, A., Glasner, D., Li, D., Unterthiner, T., & Veit, A. (2021). Understanding robustness of transformers for image classification. In ICCV.
Zurück zum Zitat Bhowmik, P., Pantho, M. J. H., & Bobda, C. (2021). Bio-inspired smart vision sensor: Toward a reconfigurable hardware modeling of the hierarchical processing in the brain. Journal of Real-Time Image Processing, 18, 157–174.CrossRef Bhowmik, P., Pantho, M. J. H., & Bobda, C. (2021). Bio-inspired smart vision sensor: Toward a reconfigurable hardware modeling of the hierarchical processing in the brain. Journal of Real-Time Image Processing, 18, 157–174.CrossRef
Zurück zum Zitat Brest, J., Greiner, S., Boskovic, B., Mernik, M., & Zumer, V. (2006). Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. In TEC. Brest, J., Greiner, S., Boskovic, B., Mernik, M., & Zumer, V. (2006). Self-adapting control parameters in differential evolution: A comparative study on numerical benchmark problems. In TEC.
Zurück zum Zitat Brest, J., Zamuda, A., Boskovic, B., Maucec, M. S., & Zumer, V. (2008). High-dimensional real-parameter optimization using self-adaptive differential evolution algorithm with population size reduction. In CEC. Brest, J., Zamuda, A., Boskovic, B., Maucec, M. S., & Zumer, V. (2008). High-dimensional real-parameter optimization using self-adaptive differential evolution algorithm with population size reduction. In CEC.
Zurück zum Zitat Brest, J., Zamuda, A., Fister, I., & Maučec, M. S. (2010). Large scale global optimization using self-adaptive differential evolution algorithm. In CEC. Brest, J., Zamuda, A., Fister, I., & Maučec, M. S. (2010). Large scale global optimization using self-adaptive differential evolution algorithm. In CEC.
Zurück zum Zitat Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. In NeurIPS. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., & Agarwal, S. (2020). Language models are few-shot learners. In NeurIPS.
Zurück zum Zitat Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: ECCV. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In: ECCV.
Zurück zum Zitat Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In ICCV. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., & Joulin, A. (2021). Emerging properties in self-supervised vision transformers. In ICCV.
Zurück zum Zitat Chen, B., Li, P., Li, C., Li, B., Bai, L., Lin, C., Sun, M., Yan, J., & Ouyang, W. (2021). Glit: Neural architecture search for global and local image transformer. In ICCV. Chen, B., Li, P., Li, C., Li, B., Bai, L., Lin, C., Sun, M., Yan, J., & Ouyang, W. (2021). Glit: Neural architecture search for global and local image transformer. In ICCV.
Zurück zum Zitat Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., & Gao, W. (2021). Pre-trained image processing transformer. In CVPR. Chen, H., Wang, Y., Guo, T., Xu, C., Deng, Y., Liu, Z., Ma, S., Xu, C., Xu, C., & Gao, W. (2021). Pre-trained image processing transformer. In CVPR.
Zurück zum Zitat Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A. L., & Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A. L., & Zhou, Y. (2021). Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:​2102.​04306
Zurück zum Zitat Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019). MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J., Zhang, Z., Cheng, D., Zhu, C., Cheng, T., Zhao, Q., Li, B., Lu, X., Zhu, R., Wu, Y., Dai, J., Wang, J., Shi, J., Ouyang, W., Loy, C. C., & Lin, D. (2019). MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:​1906.​07155
Zurück zum Zitat Chen, M., Peng, H., Fu, J., & Ling, H. (2021). Autoformer: Searching transformers for visual recognition. In ICCV. Chen, M., Peng, H., Fu, J., & Ling, H. (2021). Autoformer: Searching transformers for visual recognition. In ICCV.
Zurück zum Zitat Chen, M., Wu, K., Ni, B., Peng, H., Liu, B., Fu, J., Chao, H., & Ling, H. (2021). Searching the search space of vision transformer. In NeurIPS. Chen, M., Wu, K., Ni, B., Peng, H., Liu, B., Fu, J., Chao, H., & Ling, H. (2021). Searching the search space of vision transformer. In NeurIPS.
Zurück zum Zitat Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., & Cheng, J., Wang, J. (2022). Mixformer: Mixing features across windows and dimensions. In CVPR. Chen, Q., Wu, Q., Wang, J., Hu, Q., Hu, T., Ding, E., & Cheng, J., Wang, J. (2022). Mixformer: Mixing features across windows and dimensions. In CVPR.
Zurück zum Zitat Chen, T., Saxena, S., Li, L., Fleet, D. J., & Hinton, G. (2022). Pix2seq: A language modeling framework for object detection. In ICLR. Chen, T., Saxena, S., Li, L., Fleet, D. J., & Hinton, G. (2022). Pix2seq: A language modeling framework for object detection. In ICLR.
Zurück zum Zitat Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., Han, S., Luo, P., Zeng, G., & Wang, J. (2023). Context autoencoder for self-supervised representation learning. In IJCV. Chen, X., Ding, M., Wang, X., Xin, Y., Mo, S., Wang, Y., Han, S., Luo, P., Zeng, G., & Wang, J. (2023). Context autoencoder for self-supervised representation learning. In IJCV.
Zurück zum Zitat Chen, X., Xie, S., & He, K. (2021). An empirical study of training self-supervised visual transformers. In ICCV. Chen, X., Xie, S., & He, K. (2021). An empirical study of training self-supervised visual transformers. In ICCV.
Zurück zum Zitat Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., & Liu, Z. (2022). Mobile-former: Bridging mobilenet and transformer. In CVPR. Chen, Y., Dai, X., Chen, D., Liu, M., Dong, X., Yuan, L., & Liu, Z. (2022). Mobile-former: Bridging mobilenet and transformer. In CVPR.
Zurück zum Zitat Chen, Z., & Kang, L. (2005). Multi-population evolutionary algorithm for solving constrained optimization problems. In AIAI. Chen, Z., & Kang, L. (2005). Multi-population evolutionary algorithm for solving constrained optimization problems. In AIAI.
Zurück zum Zitat Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., & Tang, M. (2021). Dpt: Deformable patch-based transformer for visual recognition. In ACM MM. Chen, Z., Zhu, Y., Zhao, C., Hu, G., Zeng, W., Wang, J., & Tang, M. (2021). Dpt: Deformable patch-based transformer for visual recognition. In ACM MM.
Zurück zum Zitat Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In CVPR. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., & Girdhar, R. (2022). Masked-attention mask transformer for universal image segmentation. In CVPR.
Zurück zum Zitat Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. In NeurIPS. Cheng, B., Schwing, A., & Kirillov, A. (2021). Per-pixel classification is not all you need for semantic segmentation. In NeurIPS.
Zurück zum Zitat Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., & Weller, A. (2021). Rethinking attention with performers. In ICLR. Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., & Weller, A. (2021). Rethinking attention with performers. In ICLR.
Zurück zum Zitat Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. In NeurIPS. Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., & Shen, C. (2021). Twins: Revisiting the design of spatial attention in vision transformers. In NeurIPS.
Zurück zum Zitat Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia, H., & Shen, C. (2023). Conditional positional encodings for vision transformers. In ICLR. Chu, X., Tian, Z., Zhang, B., Wang, X., Wei, X., Xia, H., & Shen, C. (2023). Conditional positional encodings for vision transformers. In ICLR.
Zurück zum Zitat Coello, C. A. C., & Lamont, G. B. (2004). Applications of multi-objective evolutionary algorithms (Vol. 1). World Scientific. Coello, C. A. C., & Lamont, G. B. (2004). Applications of multi-objective evolutionary algorithms (Vol. 1). World Scientific.
Zurück zum Zitat Cordonnier, J. B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and convolutional layers. In ICLR. Cordonnier, J. B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and convolutional layers. In ICLR.
Zurück zum Zitat Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In ICCV. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In ICCV.
Zurück zum Zitat Das, S., & Suganthan, P. N. (2010). Differential evolution: A survey of the state-of-the-art. TEC. Das, S., & Suganthan, P. N. (2010). Differential evolution: A survey of the state-of-the-art. TEC.
Zurück zum Zitat Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In CVPR.
Zurück zum Zitat Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL.
Zurück zum Zitat Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., & Guo, B. (2022). Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR. Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., & Guo, B. (2022). Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR.
Zurück zum Zitat Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., & Guo, B. (2023). Peco: Perceptual codebook for Bert pre-training of vision transformers. In AAAI. Dong, X., Bao, J., Zhang, T., Chen, D., Zhang, W., Yuan, L., Chen, D., Wen, F., Yu, N., & Guo, B. (2023). Peco: Perceptual codebook for Bert pre-training of vision transformers. In AAAI.
Zurück zum Zitat Dong, Y., Cordonnier, J. B., & Loukas, A. (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In ICML. Dong, Y., Cordonnier, J. B., & Loukas, A. (2021). Attention is not all you need: Pure attention loses rank doubly exponentially with depth. In ICML.
Zurück zum Zitat Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16 \(\times \) 16 words: Transformers for image recognition at scale. In ICLR. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16 \(\times \) 16 words: Transformers for image recognition at scale. In ICLR.
Zurück zum Zitat d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., & Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. In ICML. d’Ascoli, S., Touvron, H., Leavitt, M. L., Morcos, A. S., Biroli, G., & Sagun, L. (2021). Convit: Improving vision transformers with soft convolutional inductive biases. In ICML.
Zurück zum Zitat Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., & Liu, W. (2021). You only look at one sequence: Rethinking transformer in vision through object detection. In NeurIPS. Fang, Y., Liao, B., Wang, X., Fang, J., Qi, J., Wu, R., Niu, J., & Liu, W. (2021). You only look at one sequence: Rethinking transformer in vision through object detection. In NeurIPS.
Zurück zum Zitat Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex. Felleman, D. J., & Van Essen, D. C. (1991). Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex.
Zurück zum Zitat Gao, P., Ma, T., Li, H., Lin, Z., Dai, J., & Qiao, Y. (2022). Mcmae: Masked convolution meets masked autoencoders. In NeurIPS. Gao, P., Ma, T., Li, H., Lin, Z., Dai, J., & Qiao, Y. (2022). Mcmae: Masked convolution meets masked autoencoders. In NeurIPS.
Zurück zum Zitat García-Martínez, C., & Lozano, M. (2008). Local search based on genetic algorithms. In Advances in metaheuristics for hard optimization. Springer. García-Martínez, C., & Lozano, M. (2008). Local search based on genetic algorithms. In Advances in metaheuristics for hard optimization. Springer.
Zurück zum Zitat Goyal, A., & Bengio, Y. (2022). Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A, 478, 20210068.MathSciNetCrossRef Goyal, A., & Bengio, Y. (2022). Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A, 478, 20210068.MathSciNetCrossRef
Zurück zum Zitat Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., & Xu, C. (2022). Cmt: Convolutional neural networks meet vision transformers. In CVPR. Guo, J., Han, K., Wu, H., Tang, Y., Chen, X., Wang, Y., & Xu, C. (2022). Cmt: Convolutional neural networks meet vision transformers. In CVPR.
Zurück zum Zitat Guo, M. H., Lu, C. Z., Liu, Z.N., Cheng, M. M., & Hu, S. M. (2023). Visual attention network. In CVM. Guo, M. H., Lu, C. Z., Liu, Z.N., Cheng, M. M., & Hu, S. M. (2023). Visual attention network. In CVM.
Zurück zum Zitat Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. In NeurIPS. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., & Wang, Y. (2021). Transformer in transformer. In NeurIPS.
Zurück zum Zitat Hao, Y., Dong, L., Wei, F., & Xu, K. (2021). Self-attention attribution: Interpreting information interactions inside transformer. In AAAI. Hao, Y., Dong, L., Wei, F., & Xu, K. (2021). Self-attention attribution: Interpreting information interactions inside transformer. In AAAI.
Zurück zum Zitat Hart, W. E., Krasnogor, N., & Smith, J. E. (2005). Memetic evolutionary algorithms. In Recent advances in memetic algorithms (pp. 3–27). Springer. Hart, W. E., Krasnogor, N., & Smith, J. E. (2005). Memetic evolutionary algorithms. In Recent advances in memetic algorithms (pp. 3–27). Springer.
Zurück zum Zitat Hassanat, A., Almohammadi, K., Alkafaween, E., Abunawas, E., Hammouri, A., & Prasath, V. (2019). Choosing mutation and crossover ratios for genetic algorithms—a review with a new dynamic approach. Information, 10, 390.CrossRef Hassanat, A., Almohammadi, K., Alkafaween, E., Abunawas, E., Hammouri, A., & Prasath, V. (2019). Choosing mutation and crossover ratios for genetic algorithms—a review with a new dynamic approach. Information, 10, 390.CrossRef
Zurück zum Zitat Hassani, A., Walton, S., Li, J., Li, S., & Shi, H. (2023). Neighborhood attention transformer. In CVPR. Hassani, A., Walton, S., Li, J., Li, S., & Shi, H. (2023). Neighborhood attention transformer. In CVPR.
Zurück zum Zitat He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In CVPR. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In CVPR.
Zurück zum Zitat He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Zurück zum Zitat He, R., Ravula, A., Kanagal, B., & Ainslie, J. (2020). Realformer: Transformer likes residual attention. arXiv preprint arXiv:2012.11747. He, R., Ravula, A., Kanagal, B., & Ainslie, J. (2020). Realformer: Transformer likes residual attention. arXiv preprint arXiv:​2012.​11747.
Zurück zum Zitat Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., & Le, Q. V. (2019). Searching for mobilenetv3. In ICCV. Howard, A., Sandler, M., Chu, G., Chen, L.C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., & Le, Q. V. (2019). Searching for mobilenetv3. In ICCV.
Zurück zum Zitat Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., & Fu, B. (2021). Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:2106.03650. Huang, Z., Ben, Y., Luo, G., Cheng, P., Yu, G., & Fu, B. (2021). Shuffle transformer: Rethinking spatial shuffle for vision transformer. arXiv preprint arXiv:​2106.​03650.
Zurück zum Zitat Hudson, D. A., & Zitnick, L. (2021). Generative adversarial transformers. In ICML. Hudson, D. A., & Zitnick, L. (2021). Generative adversarial transformers. In ICML.
Zurück zum Zitat Jiang, Y., Chang, S., & Wang, Z. (2021). Transgan: Two pure transformers can make one strong gan, and that can scale up. In NeurIPS. Jiang, Y., Chang, S., & Wang, Z. (2021). Transgan: Two pure transformers can make one strong gan, and that can scale up. In NeurIPS.
Zurück zum Zitat Jiang, Z.H., Hou, Q., Yuan, L., Zhou, D., Shi, Y., Jin, X., Wang, A., & Feng, J. (2021). All tokens matter: Token labeling for training better vision transformers. In NeurIPS. Jiang, Z.H., Hou, Q., Yuan, L., Zhou, D., Shi, Y., Jin, X., Wang, A., & Feng, J. (2021). All tokens matter: Token labeling for training better vision transformers. In NeurIPS.
Zurück zum Zitat Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML. Katharopoulos, A., Vyas, A., Pappas, N., & Fleuret, F. (2020). Transformers are rnns: Fast autoregressive transformers with linear attention. In ICML.
Zurück zum Zitat Khare, V., Yao, X., & Deb, K. (2003). Performance scaling of multi-objective evolutionary algorithms. In EMO. Khare, V., Yao, X., & Deb, K. (2003). Performance scaling of multi-objective evolutionary algorithms. In EMO.
Zurück zum Zitat Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., & Hong, S. (2022). Pure transformers are powerful graph learners. In NeurIPS. Kim, J., Nguyen, D., Min, S., Cho, S., Lee, M., Lee, H., & Hong, S. (2022). Pure transformers are powerful graph learners. In NeurIPS.
Zurück zum Zitat Kitaev, N., Kaiser, L., & Levskaya, A. (2020). Reformer: The efficient transformer. In ICLR. Kitaev, N., Kaiser, L., & Levskaya, A. (2020). Reformer: The efficient transformer. In ICLR.
Zurück zum Zitat Kolen, A., & Pesch, E. (1994). Genetic local search in combinatorial optimization. Discrete Applied Mathematics. Kolen, A., & Pesch, E. (1994). Genetic local search in combinatorial optimization. Discrete Applied Mathematics.
Zurück zum Zitat Kumar, S., Sharma, V. K., & Kumari, R. (2014). Memetic search in differential evolution algorithm. arXiv preprint arXiv:1408.0101. Kumar, S., Sharma, V. K., & Kumari, R. (2014). Memetic search in differential evolution algorithm. arXiv preprint arXiv:​1408.​0101.
Zurück zum Zitat Land, M. W. S. (1998). Evolutionary algorithms with local search for combinatorial optimization. University of California. Land, M. W. S. (1998). Evolutionary algorithms with local search for combinatorial optimization. University of California.
Zurück zum Zitat Lee, Y., Kim, J., Willette, J., & Hwang, S. J. (2022). Mpvit: Multi-path vision transformer for dense prediction. In CVPR. Lee, Y., Kim, J., Willette, J., & Hwang, S. J. (2022). Mpvit: Multi-path vision transformer for dense prediction. In CVPR.
Zurück zum Zitat Li, C., Tang, T., Wang, G., Peng, J., Wang, B., Liang, X., & Chang, X. (2021). Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In ICCV. Li, C., Tang, T., Wang, G., Peng, J., Wang, B., Liang, X., & Chang, X. (2021). Bossnas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In ICCV.
Zurück zum Zitat Li, K., Wang, Y., Peng, G., Song, G., Liu, Y., Li, H., & Qiao, Y. (2022). Uniformer: Unified transformer for efficient spatial-temporal representation learning. In ICLR. Li, K., Wang, Y., Peng, G., Song, G., Liu, Y., Li, H., & Qiao, Y. (2022). Uniformer: Unified transformer for efficient spatial-temporal representation learning. In ICLR.
Zurück zum Zitat Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., & Qiao, Y. (2023). Uniformer: Unifying convolution and self-attention for visual recognition. TPAMI. Li, K., Wang, Y., Zhang, J., Gao, P., Song, G., Liu, Y., Li, H., & Qiao, Y. (2023). Uniformer: Unifying convolution and self-attention for visual recognition. TPAMI.
Zurück zum Zitat Li, X., Wang, L., Jiang, Q., & Li, N. (2021). Differential evolution algorithm with multi-population cooperation and multi-strategy integration. Neurocomputing, 421, 285–302.CrossRef Li, X., Wang, L., Jiang, Q., & Li, N. (2021). Differential evolution algorithm with multi-population cooperation and multi-strategy integration. Neurocomputing, 421, 285–302.CrossRef
Zurück zum Zitat Li, Y., Hu, J., Wen, Y., Evangelidis, G., Salahi, K., Wang, Y., Tulyakov, S., & Ren, J. (2023). Rethinking vision transformers for mobilenet size and speed. In ICCV. Li, Y., Hu, J., Wen, Y., Evangelidis, G., Salahi, K., Wang, Y., Tulyakov, S., & Ren, J. (2023). Rethinking vision transformers for mobilenet size and speed. In ICCV.
Zurück zum Zitat Li, Y., Zhang, K., Cao, J., Timofte, R., & Van Gool, L. (2021). Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 Li, Y., Zhang, K., Cao, J., Timofte, R., & Van Gool, L. (2021). Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:​2104.​05707
Zurück zum Zitat Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., & Timofte, R. (2021). Swinir: Image restoration using swin transformer. In ICCV. Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., & Timofte, R. (2021). Swinir: Image restoration using swin transformer. In ICCV.
Zurück zum Zitat Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.
Zurück zum Zitat Liu, J., & Lampinen, J. (2005). A fuzzy adaptive differential evolution algorithm. Soft Computing, 9, 448–462.CrossRef Liu, J., & Lampinen, J. (2005). A fuzzy adaptive differential evolution algorithm. Soft Computing, 9, 448–462.CrossRef
Zurück zum Zitat Liu, Y., Li, H., Guo, Y., Kong, C., Li, J., & Wang, S. (2022). Rethinking attention-model explainability through faithfulness violation test. In ICML. Liu, Y., Li, H., Guo, Y., Kong, C., Li, J., & Wang, S. (2022). Rethinking attention-model explainability through faithfulness violation test. In ICML.
Zurück zum Zitat Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., & Wei, F. (2022). Swin transformer v2: Scaling up capacity and resolution. In CVPR. Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., & Wei, F. (2022). Swin transformer v2: Scaling up capacity and resolution. In CVPR.
Zurück zum Zitat Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.
Zurück zum Zitat Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In ICLR. Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In ICLR.
Zurück zum Zitat Lu, J., Mottaghi, R., & Kembhavi, A. (2021). Container: Context aggregation networks. In NeurIPS. Lu, J., Mottaghi, R., & Kembhavi, A. (2021). Container: Context aggregation networks. In NeurIPS.
Zurück zum Zitat Maaz, M., Shaker, A., Cholakkal, H., Khan, S., Zamir, S. W., Anwer, R. M., & Shahbaz Khan, F. (2023). Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. In ECCVW. Maaz, M., Shaker, A., Cholakkal, H., Khan, S., Zamir, S. W., Anwer, R. M., & Shahbaz Khan, F. (2023). Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. In ECCVW.
Zurück zum Zitat Mehta, S., & Rastegari, M. (2022). Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In ICLR. Mehta, S., & Rastegari, M. (2022). Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In ICLR.
Zurück zum Zitat Min, J., Zhao, Y., Luo, C., & Cho, M. (2022). Peripheral vision transformer. In NeurIPS. Min, J., Zhao, Y., Luo, C., & Cho, M. (2022). Peripheral vision transformer. In NeurIPS.
Zurück zum Zitat Moscato, P., et al. (1989). On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech Concurrent Computation Program, C3P Report, 826, 1989. Moscato, P., et al. (1989). On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Caltech Concurrent Computation Program, C3P Report, 826, 1989.
Zurück zum Zitat Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas v1, v2, and v4 in the presence of competing stimuli. Journal of Neurophysiology, 70, 909–919.CrossRef Motter, B. C. (1993). Focal attention produces spatially selective processing in visual cortical areas v1, v2, and v4 in the presence of competing stimuli. Journal of Neurophysiology, 70, 909–919.CrossRef
Zurück zum Zitat Nakashima, K., Kataoka, H., Matsumoto, A., Iwata, K., Inoue, N., & Satoh, Y. (2022). Can vision transformers learn without natural images? In AAAI. Nakashima, K., Kataoka, H., Matsumoto, A., Iwata, K., Inoue, N., & Satoh, Y. (2022). Can vision transformers learn without natural images? In AAAI.
Zurück zum Zitat Neimark, D., Bar, O., Zohar, M., & Asselmann, D. (2021). Video transformer network. In ICCV. Neimark, D., Bar, O., Zohar, M., & Asselmann, D. (2021). Video transformer network. In ICCV.
Zurück zum Zitat Opara, K. R., & Arabas, J. (2019). Differential evolution: A survey of theoretical analyses. Swarm and Evolutionary Computation, 44, 546–558.CrossRef Opara, K. R., & Arabas, J. (2019). Differential evolution: A survey of theoretical analyses. Swarm and Evolutionary Computation, 44, 546–558.CrossRef
Zurück zum Zitat Padhye, N., Mittal, P., & Deb, K. (2013). Differential evolution: Performances and analyses. In CEC. Padhye, N., Mittal, P., & Deb, K. (2013). Differential evolution: Performances and analyses. In CEC.
Zurück zum Zitat Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., & Huang, G. (2022). On the integration of self-attention and convolution. In CVPR. Pan, X., Ge, C., Lu, R., Song, S., Chen, G., Huang, Z., & Huang, G. (2022). On the integration of self-attention and convolution. In CVPR.
Zurück zum Zitat Pant, M., Zaheer, H., Garcia-Hernandez, L., Abraham, A., et al. (2020). Differential evolution: A review of more than two decades of research. Engineering Applications of Artificial Intelligence, 90, 103479.CrossRef Pant, M., Zaheer, H., Garcia-Hernandez, L., Abraham, A., et al. (2020). Differential evolution: A review of more than two decades of research. Engineering Applications of Artificial Intelligence, 90, 103479.CrossRef
Zurück zum Zitat Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., & Desmaison, A. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., & Desmaison, A. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.
Zurück zum Zitat Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In ICLR. Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. In ICLR.
Zurück zum Zitat Qiang, Y., Pan, D., Li, C., Li, X., Jang, R., & Zhu, D. (2022). Attcat: Explaining transformers via attentive class activation tokens. In NeurIPS. Qiang, Y., Pan, D., Li, C., Li, X., Jang, R., & Zhu, D. (2022). Attcat: Explaining transformers via attentive class activation tokens. In NeurIPS.
Zurück zum Zitat Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI.
Zurück zum Zitat Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog.
Zurück zum Zitat Raghu, M., Unterthiner, T., Kornblith, S., & Zhang, C. (2021). Dosovitskiy, A. Do vision transformers see like convolutional neural networks? In NeurIPS. Raghu, M., Unterthiner, T., Kornblith, S., & Zhang, C. (2021). Dosovitskiy, A. Do vision transformers see like convolutional neural networks? In NeurIPS.
Zurück zum Zitat Ren, S., Zhou, D., He, S., Feng, J., & Wang, X. (2022). Shunted self-attention via multi-scale token aggregation. In CVPR. Ren, S., Zhou, D., He, S., Feng, J., & Wang, X. (2022). Shunted self-attention via multi-scale token aggregation. In CVPR.
Zurück zum Zitat Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., & Parikh, D. (2017). Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV. Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., & Parikh, D. (2017). Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV.
Zurück zum Zitat Shi, E. C., Leung, F. H., & Law, B. N. (2014). Differential evolution with adaptive population size. In ICDSP. Shi, E. C., Leung, F. H., & Law, B. N. (2014). Differential evolution with adaptive population size. In ICDSP.
Zurück zum Zitat Si, C., Yu, W., Zhou, P., Zhou, Y., Wang, X., & Yan, S. (2022). Inception transformer. In NeurIPS. Si, C., Yu, W., Zhou, P., Zhou, Y., Wang, X., & Yan, S. (2022). Inception transformer. In NeurIPS.
Zurück zum Zitat Sloss, A. N., & Gustafson, S. (2020). 2019 evolutionary algorithms review. In Genetic programming theory and practice XVII. Sloss, A. N., & Gustafson, S. (2020). 2019 evolutionary algorithms review. In Genetic programming theory and practice XVII.
Zurück zum Zitat Srinivas, A., Lin, T. Y., Parmar, N., Shlens, J., Abbeel, P., & Vaswani, A. (2021). Bottleneck transformers for visual recognition. In CVPR. Srinivas, A., Lin, T. Y., Parmar, N., Shlens, J., Abbeel, P., & Vaswani, A. (2021). Bottleneck transformers for visual recognition. In CVPR.
Zurück zum Zitat Storn, R., & Price, K. (1997). Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization 341–359 Storn, R., & Price, K. (1997). Differential evolution–a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization 341–359
Zurück zum Zitat Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML. Tan, M., & Le, Q. (2019). Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML.
Zurück zum Zitat Thatipelli, A., Narayan, S., Khan, S., Anwer, R. M., Khan, F. S., & Ghanem, B. (2022). Spatio-temporal relation modeling for few-shot action recognition. In CVPR. Thatipelli, A., Narayan, S., Khan, S., Anwer, R. M., Khan, F. S., & Ghanem, B. (2022). Spatio-temporal relation modeling for few-shot action recognition. In CVPR.
Zurück zum Zitat Toffolo, A., & Benini, E. (2003). Genetic diversity as an objective in multi-objective evolutionary algorithms. Evolutionary Computation, 11, 151–167.CrossRef Toffolo, A., & Benini, E. (2003). Genetic diversity as an objective in multi-objective evolutionary algorithms. Evolutionary Computation, 11, 151–167.CrossRef
Zurück zum Zitat Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In ICML. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In ICML.
Zurück zum Zitat Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., & Jégou, H. (2021). Going deeper with image transformers. In ICCV. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., & Jégou, H. (2021). Going deeper with image transformers. In ICCV.
Zurück zum Zitat Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., & Li, Y. (2022). Maxvit: Multi-axis vision transformer. In ECCV. Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., & Li, Y. (2022). Maxvit: Multi-axis vision transformer. In ECCV.
Zurück zum Zitat Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., & Patel, V. M. (2021). Medical transformer: Gated axial-attention for medical image segmentation. In MICCAI. Valanarasu, J.M.J., Oza, P., Hacihaliloglu, I., & Patel, V. M. (2021). Medical transformer: Gated axial-attention for medical image segmentation. In MICCAI.
Zurück zum Zitat Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., & Shlens, J. (2021). Scaling local self-attention for parameter efficient visual backbones. In CVPR. Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., & Shlens, J. (2021). Scaling local self-attention for parameter efficient visual backbones. In CVPR.
Zurück zum Zitat Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., & Polosukhin, I. (2017). Attention is all you need. In NeurIPS.
Zurück zum Zitat Vikhar, P. A. (2016). Evolutionary algorithms: A critical review and its future prospects. In ICGTSPICC. Vikhar, P. A. (2016). Evolutionary algorithms: A critical review and its future prospects. In ICGTSPICC.
Zurück zum Zitat Wan, Z., Chen, H., An, J., Jiang, W., Yao, C., & Luo, J. (2022). Facial attribute transformers for precise and robust makeup transfer. In CACV. Wan, Z., Chen, H., An, J., Jiang, W., Yao, C., & Luo, J. (2022). Facial attribute transformers for precise and robust makeup transfer. In CACV.
Zurück zum Zitat Wang, H., Wu, Z., Liu, Z., Cai, H., Zhu, L., Gan, C., & Han, S. (2020). Hat: Hardware-aware transformers for efficient natural language processing. In ACL. Wang, H., Wu, Z., Liu, Z., Cai, H., Zhu, L., Gan, C., & Han, S. (2020). Hat: Hardware-aware transformers for efficient natural language processing. In ACL.
Zurück zum Zitat Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Jiang, Y. G., Zhou, L., & Yuan, L. (2022). Bevt: Bert pretraining of video transformers. In CVPR. Wang, R., Chen, D., Wu, Z., Chen, Y., Dai, X., Liu, M., Jiang, Y. G., Zhou, L., & Yuan, L. (2022). Bevt: Bert pretraining of video transformers. In CVPR.
Zurück zum Zitat Wang, S., Li, B., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768 Wang, S., Li, B., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-attention with linear complexity. arXiv preprint arXiv:​2006.​04768
Zurück zum Zitat Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV. Wang, W., Xie, E., Li, X., Fan, D.P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In ICCV.
Zurück zum Zitat Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2022). Pvt v2: Improved baselines with pyramid vision transformer. CVM. Wang, W., Xie, E., Li, X., Fan, D. P., Song, K., Liang, D., Lu, T., Luo, P., & Shao, L. (2022). Pvt v2: Improved baselines with pyramid vision transformer. CVM.
Zurück zum Zitat Wang, W., Yao, L., Chen, L., Lin, B., Cai, D., He, X., & Liu, W. (2022). Crossformer: A versatile vision transformer hinging on cross-scale attention. In ICLR. Wang, W., Yao, L., Chen, L., Lin, B., Cai, D., He, X., & Liu, W. (2022). Crossformer: A versatile vision transformer hinging on cross-scale attention. In ICLR.
Zurück zum Zitat Wang, Y., Yang, Y., Bai, J., Zhang, M., Bai, J., Yu, J., Zhang, C., Huang, G., & Tong, Y. (2021). Evolving attention with residual convolutions. In ICML. Wang, Y., Yang, Y., Bai, J., Zhang, M., Bai, J., Yu, J., Zhang, C., Huang, G., & Tong, Y. (2021). Evolving attention with residual convolutions. In ICML.
Zurück zum Zitat Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. In CVPR. Wei, C., Fan, H., Xie, S., Wu, C. Y., Yuille, A., & Feichtenhofer, C. (2022). Masked feature prediction for self-supervised visual pre-training. In CVPR.
Zurück zum Zitat Wightman, R., Touvron, H., & Jegou, H. (2021). Resnet strikes back: An improved training procedure in timm. In NeurIPSW. Wightman, R., Touvron, H., & Jegou, H. (2021). Resnet strikes back: An improved training procedure in timm. In NeurIPSW.
Zurück zum Zitat Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. In ICCV. Wu, H., Xiao, B., Codella, N., Liu, M., Dai, X., Yuan, L., & Zhang, L. (2021). Cvt: Introducing convolutions to vision transformers. In ICCV.
Zurück zum Zitat Xia, Z., Pan, X., Song, S., Li, L.E., & Huang, G. (2022). Vision transformer with deformable attention. In CVPR. Xia, Z., Pan, X., Song, S., Li, L.E., & Huang, G. (2022). Vision transformer with deformable attention. In CVPR.
Zurück zum Zitat Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In ECCV. Xiao, T., Liu, Y., Zhou, B., Jiang, Y., & Sun, J. (2018). Unified perceptual parsing for scene understanding. In ECCV.
Zurück zum Zitat Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J. M., & Luo, P. (2021). Segformer: Simple and efficient design for semantic segmentation with transformers. In NeurIPS.
Zurück zum Zitat Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). Simmim: A simple framework for masked image modeling. In CVPR. Xie, Z., Zhang, Z., Cao, Y., Lin, Y., Bao, J., Yao, Z., Dai, Q., & Hu, H. (2022). Simmim: A simple framework for masked image modeling. In CVPR.
Zurück zum Zitat Xu, L., Yan, X., Ding, W., & Liu, Z. (2023). Attribution rollout: a new way to interpret visual transformer. JAIHC. Xu, L., Yan, X., Ding, W., & Liu, Z. (2023). Attribution rollout: a new way to interpret visual transformer. JAIHC.
Zurück zum Zitat Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z., & Soatto, S. (2021). Long short-term transformer for online action detection. In NeurIPS. Xu, M., Xiong, Y., Chen, H., Li, X., Xia, W., Tu, Z., & Soatto, S. (2021). Long short-term transformer for online action detection. In NeurIPS.
Zurück zum Zitat Xu, W., Xu, Y., Chang, T., & Tu, Z. (2021). Co-scale conv-attentional image transformers. In ICCV. Xu, W., Xu, Y., Chang, T., & Tu, Z. (2021). Co-scale conv-attentional image transformers. In ICCV.
Zurück zum Zitat Xu, Y., Zhang, Q., Zhang, J., & Tao, D. (2021). Vitae: Vision transformer advanced by exploring intrinsic inductive bias. In NeurIPS. Xu, Y., Zhang, Q., Zhang, J., & Tao, D. (2021). Vitae: Vision transformer advanced by exploring intrinsic inductive bias. In NeurIPS.
Zurück zum Zitat Yang, C., Wang, Y., Zhang, J., Zhang, H., Wei, Z., Lin, Z., & Yuille, A. (2022). Lite vision transformer with enhanced self-attention. In CVPR. Yang, C., Wang, Y., Zhang, J., Zhang, H., Wei, Z., Lin, Z., & Yuille, A. (2022). Lite vision transformer with enhanced self-attention. In CVPR.
Zurück zum Zitat Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., & Yan, S. (2022). Metaformer is actually what you need for vision. In CVPR. Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., & Yan, S. (2022). Metaformer is actually what you need for vision. In CVPR.
Zurück zum Zitat Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., & Wu, W. (2021). Incorporating convolution designs into visual transformers. In ICCV. Yuan, K., Guo, S., Liu, Z., Zhou, A., Yu, F., & Wu, W. (2021). Incorporating convolution designs into visual transformers. In ICCV.
Zurück zum Zitat Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z. H., Tay, F. E., Feng, J., & Yan, S. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV. Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z. H., Tay, F. E., Feng, J., & Yan, S. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In ICCV.
Zurück zum Zitat Yuan, L., Hou, Q., Jiang, Z., Feng, J., & Yan, S. (2022). Volo: Vision outlooker for visual recognition. In TPAMI. Yuan, L., Hou, Q., Jiang, Z., Feng, J., & Yan, S. (2022). Volo: Vision outlooker for visual recognition. In TPAMI.
Zurück zum Zitat Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021). Hrformer: High-resolution vision transformer for dense predict. In NeurIPS. Yuan, Y., Fu, R., Huang, L., Lin, W., Zhang, C., Chen, X., & Wang, J. (2021). Hrformer: High-resolution vision transformer for dense predict. In NeurIPS.
Zurück zum Zitat Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., & Yang, M. H. (2022). Restormer: Efficient transformer for high-resolution image restoration. In CVPR. Zamir, S. W., Arora, A., Khan, S., Hayat, M., Khan, F. S., & Yang, M. H. (2022). Restormer: Efficient transformer for high-resolution image restoration. In CVPR.
Zurück zum Zitat Zhang, J., Li, X., Li, J., Liu, L., Xue, Z., Zhang, B., Jiang, Z., Huang, T., Wang, Y., & Wang, C. (2023). Rethinking mobile block for efficient attention-based models. In ICCV. Zhang, J., Li, X., Li, J., Liu, L., Xue, Z., Zhang, B., Jiang, Z., Huang, T., Wang, Y., & Wang, C. (2023). Rethinking mobile block for efficient attention-based models. In ICCV.
Zurück zum Zitat Zhang, J., Xu, C., Li, J., Chen, W., Wang, Y., Tai, Y., Chen, S., Wang, C., Huang, F., & Liu, Y. (2021). Analogous to evolutionary algorithm: Designing a unified sequence model. In NeurIPS. Zhang, J., Xu, C., Li, J., Chen, W., Wang, Y., Tai, Y., Chen, S., Wang, C., Huang, F., & Liu, Y. (2021). Analogous to evolutionary algorithm: Designing a unified sequence model. In NeurIPS.
Zurück zum Zitat Zhang, Q., Xu, Y., Zhang, J., & Tao, D. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. IJCV. Zhang, Q., Xu, Y., Zhang, J., & Tao, D. (2023). Vitaev2: Vision transformer advanced by exploring inductive bias for image recognition and beyond. IJCV.
Zurück zum Zitat Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H., & Zhang, L. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T., Torr, P. H., & Zhang, L. (2021). Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR.
Zurück zum Zitat Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. In IJCV. Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., & Torralba, A. (2019). Semantic understanding of scenes through the ade20k dataset. In IJCV.
Zurück zum Zitat Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Hou, Q., & Feng, J. (2021). Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 Zhou, D., Kang, B., Jin, X., Yang, L., Lian, X., Hou, Q., & Feng, J. (2021). Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:​2103.​11886
Zurück zum Zitat Zhu, X., Hu, H., Lin, S., & Dai, J. (2019). Deformable convnets v2: More deformable, better results. In CVPR. Zhu, X., Hu, H., Lin, S., & Dai, J. (2019). Deformable convnets v2: More deformable, better results. In CVPR.
Zurück zum Zitat Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable {detr}: Deformable transformers for end-to-end object detection. In ICLR. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., & Dai, J. (2021). Deformable {detr}: Deformable transformers for end-to-end object detection. In ICLR.
Metadaten
Titel
EATFormer: Improving Vision Transformer Inspired by Evolutionary Algorithm
verfasst von
Jiangning Zhang
Xiangtai Li
Yabiao Wang
Chengjie Wang
Yibo Yang
Yong Liu
Dacheng Tao
Publikationsdatum
02.04.2024
Verlag
Springer US
Erschienen in
International Journal of Computer Vision
Print ISSN: 0920-5691
Elektronische ISSN: 1573-1405
DOI
https://doi.org/10.1007/s11263-024-02034-6

Premium Partner