Visual attention network

Meng-Hao Guo; Cheng-Ze Lu; Zheng-Ning Liu; Ming-Ming Cheng; Shi-Min Hu

doi:10.1007/s41095-023-0364-2

Computational Visual Media 2023, 9(4): 733-752 https://doi.org/10.1007/s41095-023-0364-2

Research Article |

Open Access | Issue | Published: 28 July 2023

Visual attention network

Show Author's Information Hide Author's Information Meng-Hao Guo^¹, Cheng-Ze Lu^², Zheng-Ning Liu^³, Ming-Ming Cheng^², Shi-Min Hu^¹(

)

1Department of Computer Science, Tsinghua University, Beijing, China

2Nankai University, Tianjin, China

3Fitten Tech, Beijing, China

Keywords:

attention, deep learning, vision backbone, ConvNets

Cite this article:

Guo M-H, Lu C-Z, Liu Z-N, et al. Visual attention network. Computational Visual Media, 2023, 9(4): 733-752. https://doi.org/10.1007/s41095-023-0364-2

Download citation

EndNote(RIS)

BibTeX

413

Views

Downloads

Citations

108

Crossref

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

While originally designed for natural language processing tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision: (1) treating images as 1D sequences neglects their 2D structures; (2) the quadratic complexity is too expensive for high-resolution images; (3) it only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel linear attention named large kernel attention (LKA) to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings. Furthermore, we present a neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple, VAN achieves comparable results with similar size convolutional neuralnetworks (CNNs) and vision transformers (ViTs) in various tasks, including image classification, object detection, semantic segmentation, panoptic segmentation,pose estimation, etc. For example, VAN-B6 achieves 87.8% accuracy on ImageNet benchmark, and sets new state-of-the-art performance (58.2% PQ) for panoptic segmentation. Besides, VAN-B2 surpasses Swin-T 4% mIoU (50.1% vs. 46.1%) for semantic segmentation on ADE20K benchmark, 2.6% AP (48.8% vs. 46.2%) for object detection on COCO dataset. It provides a novel method and a simple yet strong baseline for the community. The code is available at https://github.com/Visual-Attention-Network.

Full text

Abstract

Full text

Outline

About this article

Visual attention network

Show Author's information Hide Author's Information Meng-Hao Guo^¹, Cheng-Ze Lu^², Zheng-Ning Liu^³, Ming-Ming Cheng^², Shi-Min Hu^¹(

)

1Department of Computer Science, Tsinghua University, Beijing, China

2Nankai University, Tianjin, China

3Fitten Tech, Beijing, China

Abstract

Keywords: attention, deep learning, vision backbone, ConvNets

References(129)

[1]

LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proceedings of the IEEE Vol. 86, No. 11, 2278–2324, 1998.

DOI Google Scholar

[2]

LeCun, Y.; Boser, B.; Denker, J. S.; Henderson, D.; Howard, R. E.; Hubbard, W.; Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural Computation Vol. 1, No. 4, 541–551, 1989.

DOI Google Scholar

[3]

Krizhevsky, A.; Sutskever, I.; Hinton, G. E. ImageNet classification with deep convolutional neural networks. Communications of the ACM Vol. 60, No. 6, 84–90, 2017.

DOI Google Scholar

[4]

Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Google Scholar

[5]

He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.

DOI

[6]

Howard, A. G.; Zhu, M. L.; Chen, B.; Kalenichenko, D.; Wang, W. J.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

Google Scholar

[7]

Xie, S. N.; Girshick, R.; Dollár, P.; Tu, Z. W.; He, K. M. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5987–5995, 2017.

DOI

[8]

Zhang, X. Y.; Zhou, X. Y.; Lin, M. X.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6848–6856, 2018.

DOI

[9]

Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K. Q. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2261–2269, 2017.

DOI

[10]

Szegedy, C.; Liu, W.; Jia, Y. Q.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9, 2015.

DOI

[11]

Gao, S. H.; Cheng, M. M.; Zhao, K.; Zhang, X. Y.; Yang, M. H.; Torr, P. Res2Net: A new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 2, 652–662, 2021.

DOI Google Scholar

[12]

Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141, 2018.

DOI

[13]

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X. H.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 2020.

[14]

Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. OverFeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.

Google Scholar

[15]

Liu, Z.; Lin, Y. T.; Cao, Y.; Hu, H.; Wei, Y. X.; Zhang, Z.; Lin, S.; Guo, B. N. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9992–10002, 2021.

DOI

[16]

Dai, J. F.; Qi, H. Z.; Xiong, Y. W.; Li, Y.; Zhang, G. D.; Hu, H.; Wei, Y. C. Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, 764–773, 2017.

DOI

[17]

Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J. M.; Luo, P. Segformer: Simple and efficient design for semantic segmentation with transformers. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.

[18]

Wang, J. D.; Sun, K.; Cheng, T. H.; Jiang, B. R.; Deng, C. R.; Zhao, Y.; Liu, D.; Mu, Y. D.; Tan, M. K.; Wang, X. G.; et al. Deep high-resolution representation learning for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 10, 3349–3364, 2021.

DOI Google Scholar

[19]

Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning, 10347–10357, 2021.

[20]

Wang, W. H.; Xie, E. Z.; Li, X.; Fan, D. P.; Song, K. T.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 548–558, 2021.

DOI

[21]

Liu, Z.; Mao, H. Z.; Wu, C. Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. N. A ConvNet for the 2020s. arXiv preprint arXiv:2201.03545, 2022.

DOI Google Scholar

[22]

Yang, J.; Li, C.; Zhang, P.; Dai, X.; Xiao, B.; Yuan, L.; Gao, J. Focal self-attention for local-global interactions in vision transformers, arXiv preprint arXiv:2107.00641, 2021.

Google Scholar

[23]

Gottlieb, J. P.; Kusunoki, M.; Goldberg, M. E. The representation of visual salience in monkey parietal cortex. Nature Vol. 391, No. 6666, 481–484, 1998.

DOI Google Scholar

[24]

Treisman, A. M.; Gelade, G. A feature-integration theory of attention. Cognitive Psychology Vol. 12, No. 1, 97–136, 1980.

DOI Google Scholar

[25]

Wolfe, J. M.; Horowitz, T. S. What attributes guide the deployment of visual attention and how do they do it? Nature Reviews Neuroscience Vol. 5, No. 6, 495–501, 2004.

DOI Google Scholar

[26]

Tsotsos, J. K.; Culhane, S. M.; Kei Wai, W. Y.; Lai, Y. Z.; Davis, N.; Nuflo, F. Modeling visual attention via selective tuning. Artificial Intelligence Vol. 78, Nos. 1–2, 507–545, 1995.

DOI Google Scholar

[27]

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010, 2017.

[28]

Devlin, J.; Chang, M. W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Google Scholar

[29]

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, Article No. 159, 1877–1901, 2020.

[30]

Woo, S.; Park, J.; Lee, J. Y.; Kweon, I. S. CBAM: Convolutional block attention module. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 3–19, 2018.

DOI

[31]

Wang, Q. L.; Wu, B. G.; Zhu, P. F.; Li, P. H.; Zuo, W. M.; Hu, Q. H. ECA-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11531–11539, 2020.

DOI

[32]

El-Nouby, A.; Touvron, H.; Caron, M.; Bojanowski, P.; Douze, M.; Joulin, A.; Laptev, I.; Neverova, N.; Synnaeve, G.; Verbeek, J.; et al. XCiT: Cross-covariance image transformers. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.

[33]

Han, Q.; Fan, Z.; Dai, Q.; Sun, L.; Cheng, M.-M.; Liu, J.; Wang, J. Demystifying local vision transformer: Sparse connectivity, weight sharing, and dynamic weight. arXiv preprint arXiv:2106.04263, 2021.

Google Scholar

[34]

Bello, I.; Fedus, W.; Du, X.; Cubuk, E. D.; Srinivas, A.; Lin, T.-Y.; Shlens, J.; Zoph, B. Revisiting ResNets: Improved training and scaling strategies. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.

[35]

Sandler, M.; Howard, A.; Zhu, M. L.; Zhmoginov, A.; Chen, L. C. MobileNetV2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4510–4520, 2018.

DOI

[36]

Lin, M.; Chen, Q.; Yan, S. Network in network. In: Proceedings of the International Conference on Learning Representations, 2014.

[37]

Chen, L. C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. L. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv preprint arXiv:1412.7062, 2014.

Google Scholar

[38]

Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.

Google Scholar

[39]

Ding, X. H.; Zhang, X. Y.; Han, J. G.; Ding, G. G. Scaling up your kernels to 31 × 31: Revisiting large kernel design in CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11953–11965, 2022.

DOI

[40]

Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2, 2204–2212, 2014.

[41]

Hu, H.; Gu, J. Y.; Zhang, Z.; Dai, J. F.; Wei, Y. C. Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3588–3597, 2018.

DOI

[42]

Yuan, Y. H.; Chen, X. L.; Wang, J. D. Object-contextual representations for semantic segmentation. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12351. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 173–190, 2020.

DOI

[43]

Geng, Z. Y.; Guo, M. H.; Chen, H. X.; Li, X.; Wei, K.; Lin, Z. C. Is attention better than matrix decomposition? In: Proceedings of the International Conference on Learning Representations, 2021.

[44]

Guo, M. H.; Xu, T. X.; Liu, J. J.; Liu, Z. N.; Jiang, P. T.; Mu, T. J.; Zhang, S. H.; Martin, R. R.; Cheng, M. M.; Hu, S. M. Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331–368, 2022.

DOI Google Scholar

[45]

Xu, Y. F.; Wei, H. P.; Lin, M. X.; Deng, Y. Y.; Sheng, K. K.; Zhang, M. D.; Tang, F.; Dong, W. M.; Huang, F. Y.; Xu, C. S. Transformers in computational visual media: A survey. Computational Visual Media Vol. 8, No. 1, 33–62, 2022.

DOI Google Scholar

[46]

Wang, X. L.; Girshick, R.; Gupta, A.; He, K. M. Non-local neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794–7803, 2018.

DOI

[47]

Fu, J.; Liu, J.; Tian, H. J.; Li, Y.; Bao, Y. J.; Fang, Z. W.; Lu, H. Q. Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3141–3149, 2019.

DOI

[48]

Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909,2019.

Google Scholar

[49]

Bello, I.; Zoph, B.; Le, Q.; Vaswani, A.; Shlens, J. Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3285–3294, 2019.

DOI

[50]

Yuan, Y. H.; Huang, L.; Guo, J. Y.; Zhang, C.; Chen, X. L.; Wang, J. D. OCNet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916, 2018.

Google Scholar

[51]

Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In: Proceedings of the 36th International Conference on Machine Learning, 7354–7363, 2019.

[52]

Xie, S. N.; Liu, S. N.; Chen, Z. Y.; Tu, Z. W. Attentional ShapeContextNet for point cloud recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4606–4615, 2018.

DOI

[53]

Huang, Z. L.; Wang, X. G.; Huang, L. C.; Huang, C.; Wei, Y. C.; Liu, W. Y. CCNet: Criss-cross attention for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 603–612, 2019.

DOI

[54]

Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 213–229, 2020.

DOI

[55]

Guo, M. H.; Cai, J. X.; Liu, Z. N.; Mu, T. J.; Martin, R. R.; Hu, S. M. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187–199, 2021.

DOI Google Scholar

[56]

Srinivas, A.; Lin, T. Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16514–16524, 2021.

DOI

[57]

Yuan, L.; Chen, Y. P.; Wang, T.; Yu, W. H.; Shi, Y. J.; Jiang, Z. H.; Tay, F. E. H.; Feng, J. S.; Yan, S. C. Tokens-to-token ViT: Training vision transformers from scratch on ImageNet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 538–547, 2021.

DOI

[58]

Liu, R.; Deng, H. M.; Huang, Y. Y.; Shi, X. Y.; Lu, L. W.; Sun, W. X.; Wang, X. G.; Dai, J. F.; Li, H. S. Decoupled spatial-temporal transformer for video inpainting. arXiv preprint arXiv:2104.06637, 2021.

Google Scholar

[59]

Bello, I. LambdaNetworks: Modeling long-range interactions without attention. In: Proceedings of the International Conference on Learning Representations, 2021.

[60]

Xu, Y.; Zhang, Q.; Zhang, J.; Tao, D. ViTAE: Vision transformer advanced by exploring intrinsic inductive bias. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.

[61]

Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.

[62]

Liu, R.; Deng, H. M.; Huang, Y. Y.; Shi, X. Y.; Lu, L. W.; Sun, W. X.; Wang, X. G.; Dai, J. F.; Li, H. S. FuseFormer: Fusing fine-grained information in transformers for video inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 14020–14029, 2021.

DOI

[63]

Bao, H. B.; Dong, L.; Piao, S. H.; Wei, F. R. BEiT: BERT pre-training of image transformers. In: Proceedings of the International Conference on Learning Representations, 2022.

[64]

Liu, S. L.; Li, F.; Zhang, H.; Yang, X.; Qi, X. B.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes are better queries for DETR. In: Proceedings of the International Conference on Learning Representations, 2022.

[65]

Wu, H. P.; Xiao, B.; Codella, N.; Liu, M. C.; Dai, X. Y.; Yuan, L.; Zhang, L. CvT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 22–31, 2021.

DOI

[66]

Liu, S. L.; Zhang, L.; Yang, X.; Su, H.; Zhu, J. Query2Label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834, 2021.

Google Scholar

[67]

Wu, Y. H.; Liu, Y.; Zhan, X.; Cheng, M. M. P2T: Pyramid pooling transformer for scene understanding. arXiv preprint arXiv:2106.12011, 2021.

DOI Google Scholar

[68]

He, K. M.; Chen, X. L.; Xie, S. N.; Li, Y. H.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15979–15988, 2022.

DOI

[69]

Xu, W. J.; Xu, Y. F.; Chang, T.; Tu, Z. W. Co-scale conv-attentional image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9961–9970, 2021.

DOI

[70]

Chen, Q.; Wu, Q. M.; Wang, J.; Hu, Q. H.; Hu, T.; Ding, E. R.; Cheng, J.; Wang, J. D. MixFormer: Mixing features across windows and dimensions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5239–5249, 2022.

DOI

[71]

Chen, L.; Zhang, H. W.; Xiao, J.; Nie, L. Q.; Shao, J.; Liu, W.; Chua, T. S. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6298–6306, 2017.

DOI

[72]

Qin, Z. Q.; Zhang, P. Y.; Wu, F.; Li, X. FcaNet: frequency channel attention networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 763–772, 2021.

DOI

[73]

Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review Vol. 65, No. 6, 386–408, 1958.

DOI Google Scholar

[74]

Rumelhart, D. E.; Hinton, G. E.; Williams, R. J. Learning internal representations by error propagation. Technical Report. Institute for Cognitive Science, University of California, San Diego, 1985.

DOI Google Scholar

[75]

Tolstikhin, I. O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. MLP-Mixer: An all-MLP architecture for vision. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.

[76]

Guo, M. H.; Liu, Z. N.; Mu, T. J.; Hu, S. M. Beyond self-attention: External attention using two linear layers for visual tasks. arXiv preprint arXiv:2105.02358, 2021.

DOI Google Scholar

[77]

Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. ResMLP: Feedforward networks for image classification with data-efficient training. arXiv preprint arXiv:2105.03404, 2021.

DOI Google Scholar

[78]

Liu, H.; Dai, Z.; So, D.; Le, Q. V. Pay attention to MLPs. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.

[79]

Guo, M. H.; Liu, Z. N.; Mu, T. J.; Liang, D.; Martin, R. R.; Hu, S. M. Can attention enable MLPs to catch up with CNNs? Computational Visual Media Vol. 7, No. 3, 283–288, 2021.

DOI Google Scholar

[80]

Liu, R.; Li, Y.; Tao, L.; Liang, D.; Zheng, H. T. Are we ready for a new paradigm shift? A survey on visual deep MLP. Patterns Vol. 3, No. 7, 100520, 2022.

DOI Google Scholar

[81]

Wang, F.; Jiang, M. Q.; Qian, C.; Yang, S.; Li, C.; Zhang, H. G.; Wang, X. G.; Tang, X. O. Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6450–6458, 2017.

DOI

[82]

Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 9423–9433, 2018.

[83]

Park, J.; Woo, S.; Lee, J. Y.; Kweon, I. S. Bam: Bottleneck attention module. arXiv preprint arXiv:1807.06514, 2018.

Google Scholar

[84]

Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, Vol. 1, 448–456, 2015.

[85]

Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.

Google Scholar

[86]

Wang, W. H.; Xie, E. Z.; Li, X.; Fan, D. P.; Song, K. T.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with pyramid vision transformer. arXiv preprint arXiv:2106.13797, 2021.

DOI Google Scholar

[87]

Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision, 618–626, 2017.

DOI

[88]

Deng, J.; Dong, W.; Socher, R.; Li, L. J.; Kai, L.; Li, F. F. ImageNet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248–255, 2009.

DOI

[89]

Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision – ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.

DOI

[90]

Zhou, B. L.; Zhao, H.; Puig, X.; Xiao, T. T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision Vol. 127, No. 3, 302–321, 2019.

DOI Google Scholar

[91]

Zhou, B. L.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2921–2929, 2016.

DOI

[92]

Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article No. 721, 8026–8037, 2019.

[93]

Hu, S. M.; Liang, D.; Yang, G. Y.; Yang, G. W.; Zhou, W. Y. Jittor: A novel deep learning framework with meta-operators and unified graph execution. Science China Information Sciences Vol. 63, No. 12, 222103, 2020.

DOI Google Scholar

[94]

Müller, R.; Kornblith, S.; Hinton, G. E. When does label smoothing help? In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article No. 422, 4694–4703, 2019.

[95]

Zhang, H. Y.; Cisse, M.; Dauphin, Y. N.; Lopez-Paz, D. Mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.

Google Scholar

[96]

Yun, S.; Han, D.; Chun, S.; Oh, S. J.; Yoo, Y.; Choe, J. CutMix: Regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6022–6031, 2019.

DOI

[97]

Zhong, Z.; Zheng, L.; Kang, G. L.; Li, S. Z.; Yang, Y. Random erasing data augmentation. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34, No. 7, 13001–13008, 2020.

DOI Google Scholar

[98]

Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Google Scholar

[99]

Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.

Google Scholar

[100]

Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.

Google Scholar

[101]

Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 32–42, 2021.

DOI

[102]

Polyak, B. T.; Juditsky, A. B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization Vol. 30, No. 4, 838–855, 1992.

DOI Google Scholar

[103]

Dai, Z.; Liu, H.; Le, Q.; Tan, M. CoAtNet: Marrying convolution and attention for all data sizes. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.

[104]

Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In: Proceedings of the 38th International Conference on Machine Learning, 10096–10106, 2021.

[105]

Yang, J.; Li, C.; Dai, X.; Gao, J. Focal modulation networks. In: Proceedings of the 36th Conference on Neural Information Processing Systems, 2022.

[106]

Luo, W.; Li, Y.; Urtasun, R.; Zemel, R. Understanding the effective receptive field in deep convolutional neural networks. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, 4905–4913, 2016.

[107]

Lin, T. Y.; Goyal, P.; Girshick, R.; He, K. M.; Dollár, P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2999–3007, 2017.

DOI

[108]

Yu, W. H.; Luo, M.; Zhou, P.; Si, C. Y.; Zhou, Y. C.; Wang, X. C.; Feng, J. S.; Yan, S. C. MetaFormer is actually what You need for vision. arXiv preprint arXiv:2111.11418, 2021.

DOI Google Scholar

[109]

Guo, J. Y.; Han, K.; Wu, H.; Tang, Y. H.; Chen, X. H.; Wang, Y. H.; Xu, C. CMT: Convolutional neural networks meet vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12165–12175, 2022.

DOI

[110]

He, K. M.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2980–2988, 2017.

DOI

[111]

Lee, Y.; Kim, J.; Willette, J.; Hwang, S. J. MPViT: Multi-path vision transformer for dense prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7277–7286, 2022.

DOI

[112]

Cai, Z. W.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 5, 1483–1498, 2021.

DOI Google Scholar

[113]

Zhang, S. F.; Chi, C.; Yao, Y. Q.; Lei, Z.; Li, S. Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9756–9765, 2020.

DOI

[114]

Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, Article No. 1763, 21002–21012, 2020.

[115]

Kirillov, A.; Girshick, R.; He, K. M.; Dollár, P. Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6392–6401, 2019.

DOI

[116]

Xiao, T. T.; Liu, Y. C.; Zhou, B. L.; Jiang, Y. N.; Sun, J. Unified perceptual parsing for scene understanding. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11209. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 432–448, 2018.

DOI

[117]

Chen, K.; Wang, J. Q.; Pang, J. M.; Cao, Y. H.; Xiong, Y.; Li, X. X.; Sun, S. Y.; Feng, W. S.; Liu, Z. W.; Xu, J. R.; et al. MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.

Google Scholar

[118]

Sun, P. Z.; Zhang, R. F.; Jiang, Y.; Kong, T.; Xu, C. F.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z. H.; Wang, C. H.; et al. Sparse R-CNN: End-to-end object detection with learnable proposals. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14449–14458, 2021.

DOI

[119]

MMSegmentation: OpenMMLab semantic segmentation toolbox and benchmark. 2020. Available at https://github.com/open-mmlab/mmsegmentation.

[120]

Cheng, B. W.; Misra, I.; Schwing, A. G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1280–1289, 2022.

DOI

[121]

Xiao, B.; Wu, H. P.; Wei, Y. C. Simple baselines for human pose estimation and tracking. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11210. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 472–487, 2018.

DOI

[122]

OpenMMLab pose estimation toolbox and benchmark. 2020. Available at https://github.com/open-mmlab/mmpose.

[123]

Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schroff, F.; Belongie, S.; Perona, P. Caltech-UCSD Birds 200. Computation & Neural Systems Technical Report 2010-001. California Institute of Technology, 2010.

[124]

OpenMMLab pre-training toolbox and benchmark. 2020. Available at https://github.com/open-mmlab/mmclassification.

[125]

Wu, Y. H.; Liu, Y.; Zhang, L.; Cheng, M. M.; Ren, B. EDN: Salient object detection via extremely-downsampled network. IEEE Transactions on Image Processing Vol. 31, 3125–3136, 2022.

DOI Google Scholar

[126]

Wang, L. J.; Lu, H. C.; Wang, Y. F.; Feng, M. Y.; Wang, D.; Yin, B. C.; Ruan, X. Learning to detect salient objects with image-level supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3796–3805, 2017.

DOI

[127]

Yang, C.; Zhang, L. H.; Lu, H. C.; Ruan, X.; Yang, M. H. Saliency detection via graph-based manifold ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3166–3173, 2013.

DOI

[128]

Li, Y.; Hou, X. D.; Koch, C.; Rehg, J. M.; Yuille, A. L. The secrets of salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 280–287, 2014.

DOI

[129]

Bai, S. J.; Kolter, J. Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018.

Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 03 April 2023

Accepted: 28 June 2023

Published: 28 July 2023

Issue date: December 2023

Copyright

Acknowledgements

This paper was supported by National Key R&D Program of China (Project No. 2021ZD0112902), the National Natural Science Foundation of China (Project No. 62220106003), and Tsinghua–Tencent Joint Laboratory for Internet Innovation Technology.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.