Journal Home > Volume 8 , Issue 3

Humans can naturally and effectively find salient regions in complex scenes. Motivated by thisobservation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks, and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention, and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is dedicated to collecting related work. We also suggest future directions for attention mechanism research.


menu
Abstract
Full text
Outline
About this article

Attention mechanisms in computer vision: A survey

Show Author's information Meng-Hao Guo1Tian-Xing Xu1Jiang-Jiang Liu2Zheng-Ning Liu1Peng-Tao Jiang2Tai-Jiang Mu1Song-Hai Zhang1Ralph R. Martin3Ming-Ming Cheng2Shi-Min Hu1( )
BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
TKLNDST, College of Computer Science, Nankai University, Tianjin 300350, China
School of Computer Science and Informatics, Cardiff University, Cardiff, UK

Abstract

Humans can naturally and effectively find salient regions in complex scenes. Motivated by thisobservation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multi-modal tasks, and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention, and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

Keywords: attention, computer vision, deep learning, transformer, salience

References(185)

[1]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 20, No. 11, 1254-1259, 1998.
[2]
Hayhoe, M.; Ballard, D. Eye movements in natural behavior. Trends in Cognitive Sciences Vol. 9, No. 4, 188-194, 2005
[3]
Rensink, R. A. The dynamic representation of scenes. Visual Cognition Vol. 7, Nos. 1-3, 17-42, 2000.
[4]
Corbetta, M.; Shulman, G. L. Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience Vol. 3, No. 3, 201-215, 2002.
[5]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. H. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 8, 2011-2023, 2020.
[6]
Woo, S.; Park, J.; Lee, J.; Kweon, I. S. CBAM: Convolutional block attention module. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Smin-chisescu, C.; Weiss, Y. Eds. Springer Cham, 3-19, 2018.
[7]
Dai, J. F.; Qi, H. Z.; Xiong, Y. W.; Li, Y.; Zhang, G. D.; Hu, H.; Wei, Y. Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, 764-773, 2017.
DOI
[8]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 213-229, 2020.
DOI
[9]
Yuan, Y.; Wang, J. OCNet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916, 2018.
[10]
Fu, J.; Liu, J.; Tian, H. J.; Li, Y.; Bao, Y. J.; Fang, Z. W.; Lu, H. Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3141-3149, 2019.
DOI
[11]
Yang, J. L.; Ren, P. R.; Zhang, D. Q.; Chen, D.; Wen, F.; Li, H. D.; Hua, G. Neural aggregation network for video face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5216-5225, 2017.
DOI
[12]
Wang, Q. C.; Wu, T. Y.; Zheng, H.; Guo, G. D. Hierarchical pyramid diverse attention networks for face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8323-8332, 2020.
DOI
[13]
Li, W.; Zhu, X. T.; Gong, S. G. Harmonious attention network for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2285-2294, 2018.
DOI
[14]
Chen, B. H.; Deng, W. H.; Hu, J. N. Mixed high-order attention network for person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 371-381, 2019.
DOI
[15]
Wang, X. L.; Girshick, R.; Gupta, A.; He, K. M. Non-local neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794-7803, 2018.
DOI
[16]
Du, W. B.; Wang, Y. L.; Qiao, Y. Recurrent spatial-temporal attention network for action recognition in videos. IEEE Transactions on Image Processing Vol. 27, No. 3, 1347-1360, 2018.
[17]
Peng, Y. X.; He, X. T.; Zhao, J. J. Object-part attention model for fine-grained image classification. IEEE Transactions on Image Processing Vol. 27, No. 3, 1487-1500, 2018.
[18]
He, P.; Huang, W. L.; He, T.; Zhu, Q. L.; Qiao, Y.; Li, X. L. Single shot text detector with regional attention. In: Proceedings of the IEEE International Conference on Computer Vision, 3066-3074, 2017.
DOI
[19]
Oktay, O.; Schlemper, J.; Folgoc, L. L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N. Y.; Kainz, B.; et al. Attention U-Net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
[20]
Guan, Q.; Huang, Y.; Zhong, Z.; Zheng, Z.; Zheng, L.; Yang, Y. Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification. arXiv preprint arXiv:1801.09927, 2018.
[21]
Gregor, K.; Danihelka, I.; Graves, A.; Wierstra, D. DRAW: A recurrent neural network for image generation. In: Proceedings of the 32nd International Conference on Machine Learning, 1462-1471, 2015.
[22]
Zhang, H.; Goodfellow, I. J.; Metaxas, D. N.; Odena, A. Self-attention generative adversarial networks. In: Proceedings of the 36th International Conference on Machine Learning, 7354-7363, 2019.
[23]
Chu, X.; Yang, W.; Ouyang, W. L.; Ma, C.; Yuille, A. L.; Wang, X. G. Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5669-5678, 2017.
DOI
[24]
Dai, T.; Cai, J. R.; Zhang, Y. B.; Xia, S. T.; Zhang, L. Second-order attention network for single image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11057-11066, 2019.
DOI
[25]
Zhang, Y. L.; Li, K. P.; Li, K.; Wang, L. C.; Zhong, B. N.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 294-310, 2018.
[26]
Xie, S. N.; Liu, S. N.; Chen, Z. Y.; Tu, Z. W. Attentional ShapeContextNet for point cloud recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4606-4615, 2018.
DOI
[27]
Guo, M. H.; Cai, J. X.; Liu, Z. N.; Mu, T. J.; Martin, R. R.; Hu, S. M. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187-199, 2021.
[28]
Su, W. J.; Zhu, X. Z.; Cao, Y.; Li, B.; Lu, L. W.; Wei, F. R.; Dai, J. L-BERT: Pre-training of generic visual-linguistic representations. In: Proceedings of the International Conference on Learning Representations, 2020.
[29]
Xu, T.; Zhang, P. C.; Huang, Q. Y.; Zhang, H.; Gan, Z.; Huang, X. L.; He, X. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316-1324, 2018.
DOI
[30]
Wu, Y. X.; He, K. M. Group normalization. International Journal of Computer Vision Vol. 128, No. 3, 742-755, 2020.
[31]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2, 2204-2212, 2014.
[32]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2017-2025, 2015.
[33]
Vaswani, A.; Shazeer, N. M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing System, 6000-6010, 2017.
[34]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations, 2021.
[35]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, 2048-2057, 2015.
[36]
Zhu, X. Z.; Hu, H.; Lin, S.; Dai, J. F. Deformable ConvNets V2: More deformable, better results. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9300-9308, 2019.
DOI
[37]
Wang, Q. L.; Wu, B. G.; Zhu, P. F.; Li, P. H.; Zuo, W. M.; Hu, Q. H. ECA-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11531-11539, 2020.
DOI
[38]
Devlin, J.; Chang, M. W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[39]
Yang, Z. L.; Dai, Z. H.; Yang, Y. M.; Carbonell, J. G.; Salakhutdinov, R.; Le, Q. V. XLNet: Generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, 2019.
[40]
Li, X.; Zhong, Z. S.; Wu, J. L.; Yang, Y. B.; Lin, Z. C.; Liu, H. Expectation-maximization attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9166-9175, 2019.
DOI
[41]
Huang, Z. L.; Wang, X. G.; Huang, L. C.; Huang, C.; Wei, Y. C.; Liu, W. Y. CCNet: Criss-cross attention for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020.
[42]
Geng, Z.; Guo, M.-H.; Chen, H.; Li, X.; Wei, K.; Lin, Z. Is attention better than matrix decomposition? In: Proceedings of the International Conference on Learning Representations, 2021.
[43]
Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, 2019.
[44]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.-H.; Tay, F. E.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training vision transformers from scratch on ImageNet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 558-567, 2021.
DOI
[45]
Wang, W. H.; Xie, E. Z.; Li, X.; Fan, D. P.; Song, K. T.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Visio, 568-578, 2021.
DOI
[46]
Liu, Z.; Lin, Y. T.; Cao, Y.; Hu, H.; Guo, B. N. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012-10022, 2021.
DOI
[47]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 22-31, 2021.
DOI
[48]
Yuan, L.; Hou, Q. B.; Jiang, Z. H.; Feng, J. S.; Yan, S. C. VOLO: Vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112, 2021.
[49]
Dai, Z. H.; Liu, H. X.; Le, Q. V.; Tan, M. X. CoAtNet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803, 2021.
[50]
Chen, L.; Zhang, H. W.; Xiao, J.; Nie, L. Q.; Shao, J.; Liu, W.; Chua, T. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6298-6306, 2017.
DOI
[51]
Nair, V.; Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, 807-814, 2010.
[52]
Ioffe, S.; Szegedy, C. Batch normalization: Accele-rating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, Vol. 37, 448-456, 2015.
[53]
Zhang, H.; Dana, K.; Shi, J. P.; Zhang, Z. Y.; Wang, X. G.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7151-7160, 2018.
DOI
[54]
Gao, Z. L.; Xie, J. T.; Wang, Q. L.; Li, P. H. Global second-order pooling convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3019-3028, 2019.
DOI
[55]
Lee, H.; Kim, H. E.; Nam, H. SRM: A style-based recalibration module for convolutional neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1854-1862, 2019.
DOI
[56]
Yang, Z. X.; Zhu, L. C.; Wu, Y.; Yang, Y. Gated channel transformation for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11791-11800, 2020.
DOI
[57]
Qin, Z. Q.; Zhang, P. Y.; Wu, F.; Li, X. FcaNet: Frequency channel attention networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 783-792, 2021.
DOI
[58]
Diba, A. L.; Fayyaz, M.; Sharma, V.; Arzani, M. M.; Yousefzadeh, R.; Gall, J.; van Gool, L. Spatio-temporal channel correlation networks for action classification. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11208. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springe Cham, 299-315, 2018.
DOI
[59]
Chen, Z. R.; Li, Y.; Bengio, S.; Si, S. You look twice: GaterNet for dynamic filter selection in CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9164-9172, 2019.
DOI
[60]
Shi, H. Y.; Lin, G. S.; Wang, H.; Hung, T. Y.; Wang, Z. H. SpSequenceNet: Semantic segmentation network on 4D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4573-4582, 2020.
DOI
[61]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 9423-9433, 2018.
[62]
Yan, X.; Zheng, C. D.; Li, Z.; Wang, S.; Cui, S. G. PointASNL: Robust point clouds processing using nonlocal neural networks with adaptive sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5588-5597, 2020.
DOI
[63]
Hu, H.; Gu, J. Y.; Zhang, Z.; Dai, J. F.; Wei, Y. C. Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3588-3597, 2018.
DOI
[64]
Zhang, H.; Zhang, H.; Wang, C. G.; Xie, J. Y. Co-occurrent features in semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 548-557, 2019.
DOI
[65]
Bello, I.; Zoph, B.; Le, Q.; Vaswani, A.; Shlens, J. Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3285-3294, 2019.
DOI
[66]
Zhu, X. Z.; Cheng, D. Z.; Zhang, Z.; Lin, S.; Dai, J. F. An empirical study of spatial attention mechanisms in deep networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6687-6696, 2019.
DOI
[67]
Li, X.; Yang, Y. B.; Zhao, Q. J.; Shen, T. C.; Lin, Z. C.; Liu, H. Spatial pyramid based graph reasoning for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8947-8956, 2020.
DOI
[68]
Zhu, Z.; Xu, M. D.; Bai, S.; Huang, T. T.; Bai, X. Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 593-602, 2019.
DOI
[69]
Cao, Y.; Xu, J. R.; Lin, S.; Wei, F. Y.; Hu, H. GCNet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 1971-1980, 2019.
DOI
[70]
Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. A2-nets: Double attention networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 350-359, 2018.
[71]
Chen, Y. P.; Rohrbach, M.; Yan, Z. C.; Yan, S. C.; Feng, J. S.; Kalantidis, Y. Graph-based global reasoning networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 433-442, 2019.
DOI
[72]
Zhang, S. Y.; Yan, S. P.; He, X. M. LatentGNN: Learning efficient non-local relations for visual recognition. In: Proceedings of the 36th International Conference on Machine Learning, 7374-7383, 2019.
[73]
Yuan, Y.; Chen, X.; Chen, X.; Wang, J. Segmen-tation transformer: Object-contextual representations for semantic segmentation. arXiv preprint arXiv: 1909.11065, 2019.
[74]
Yin, M. H.; Yao, Z. L.; Cao, Y.; Li, X.; Zhang, Z.; Lin, S.; Hu, H. Disentangled non-local neural networks. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12360. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 191-207, 2020.
[75]
Guo, M. H.; Liu, Z. N.; Mu, T. J.; Hu, S. M. Beyond self-attention: External attention using two linear layers for visual tasks. arXiv preprint arXiv:2105.02358, 2021.
[76]
Hu, H.; Zhang, Z.; Xie, Z. D.; Lin, S. Local relation networks for image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3463-3472, 2019.
DOI
[77]
Zhao, H. S.; Jia, J. Y.; Koltun, V. Exploring self-attention for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10073-10082, 2020.
DOI
[78]
Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In: Proceedings of the 37th International Conference on Machine Learning, 1691-1703, 2020.
[79]
Chen, H. T.; Wang, Y. H.; Guo, T. Y.; Xu, C.; Deng, Y. P.; Liu, Z. H.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12294-12305, 2021.
DOI
[80]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 16259-16268, 2021.
DOI
[81]
Zheng, S. X.; Lu, J. C.; Zhao, H. S.; Zhu, X. T.; Luo, Z. K.; Wang, Y. B.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P. H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6877-6886, 2021.
DOI
[82]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. arXiv preprint arXiv:2103.00112, 2021.
[83]
Liu, S. L.; Zhang, L.; Yang, X.; Su, H.; Zhu, J. Query2Label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834, 2021.
[84]
Chen, X. L.; Xie, S. N.; He, K. M. An empirical study of training self-supervised visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9640-9649, 2021.
DOI
[85]
Bao, H. B.; Dong, L.; Wei, F. R. BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
[86]
Xie, E. Z.; Wang, W. H.; Yu, Z. D.; Anandkumar, A.; Álvarez, J.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203, 2021.
[87]
Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C. C.; Lin, D.; Jia, J. PSANet: Point-wise spatial attention network for scene parsing. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11213. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 270-286, 2018.
[88]
Ba, J.; Mnih, V.; Kavukcuoglu, K. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
[89]
Sharma, S.; Kiros, R.; Salakhutdinov, R. Action recognition using visual attention. arXiv preprint arXiv:1511.04119, 2015.
[90]
Girdhar, R.; Ramanan, D. Attentional pooling for action recognition. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 33-44, 2017.
[91]
Li, Z. Y.; Gavrilyuk, K.; Gavves, E.; Jain, M.; Snoek, C. G. M. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding Vol. 166, 41-50, 2018.
[92]
Yue, K. Y.; Sun, M.; Yuan, Y. C.; Zhou, F.; Ding, E. R.; Xu, F. X. Compact generalized non-local network. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6511-6520, 2018.
[93]
Liu, X. H.; Han, Z. Z.; Wen, X.; Liu, Y. S.; Zwicker, M. L2G auto-encoder: Understanding point clouds by local-to-global reconstruction with hierarchical self-attention. In: Proceedings of the 27th ACM International Conference on Multimedia, 989-997, 2019.
DOI
[94]
Paigwar, A.; Erkent, O.; Wolf, C.; Laugier, C. Attentional PointNet for 3D-object detection in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 1297-1306, 2019.
DOI
[95]
Wen, X.; Han, Z. Z.; Youk, G.; Liu, Y. S. CF-SIS: Semantic-instance segmentation of 3D point clouds by context fusion with self-attention. In: Proceedings of the 28th ACM International Conference on Multimedia, 1661-1669, 2020.
DOI
[96]
Yang, J. C.; Zhang, Q.; Ni, B. B.; Li, L. G.; Liu, J. X.; Zhou, M. D.; Tian, Q. Modeling point clouds with self-attention and gumbel subset sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3318-3327, 2019.
DOI
[97]
Xu, J.; Zhao, R.; Zhu, F.; Wang, H. M.; Ouyang, W. L. Attention-aware compositional network for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2119-2128, 2018.
DOI
[98]
Liu, H.; Feng, J. S.; Qi, M. B.; Jiang, J. G.; Yan, S. C. End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing Vol. 26, No. 7, 3492-3506, 2017.
[99]
Zheng, Z. D.; Zheng, L.; Yang, Y. Pedestrian alignment network for large-scale person re-identification. IEEE Transactions on Circuits and Systems for Video Technology Vol. 29, No. 10, 3037-3045, 2019.
[100]
Li, K. P.; Wu, Z. Y.; Peng, K. C.; Ernst, J.; Fu, Y. Tell me where to look: Guided attention inference network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9215-9223, 2018.
[101]
Zhang, Z. Z.; Lan, C. L.; Zeng, W. J.; Jin, X.; Chen, Z. B. Relation-aware global attention for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3183-3192, 2020.
DOI
[102]
Zhao, B.; Wu, X.; Feng, J. S.; Peng, Q.; Yan, S. C. Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia Vol. 19, No. 6, 1245-1256, 2017.
[103]
Bryan, B.; Gong, Y.; Zhang, Y. Z.; Poellabauer, C. Second-order non-local attention networks for person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3759-3768, 2019.
DOI
[104]
Zheng, H. L.; Fu, J. L.; Mei, T.; Luo, J. B. Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE International Conference on Computer Vision, 5219-5227, 2017.
DOI
[105]
Fu, J. L.; Zheng, H. L.; Mei, T. Look closer to see better: Recurrent attention convolutionalneural network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4476-4484, 2017.
[106]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv preprint arXiv:2201.12329, 2022.
[107]
Yang, G. Y.; Li, X. L.; Martin, R.; Hu, S. M. Sampling equivariant self-attention networks for object detection in aerial images. arXiv preprint arXiv:2111.03420, 2021.
[108]
Zheng, H. L.; Fu, J. L.; Zha, Z. J.; Luo, J. B. Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5007-5016, 2019.
DOI
[109]
Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A. R.; Choi, S.; Teh Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In: Proceedings of the 36th International Conference on Machine Learning, 3744-3753, 2019.
[110]
Xu, S. J.; Cheng, Y.; Gu, K.; Yang, Y.; Chang, S. Y.; Zhou, P. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, 4743-4752, 2017.
[111]
Zhang, R. M.; Li, J. Y.; Sun, H. B.; Ge, Y. Y.; Luo, P.; Wang, X. G.; Lin, L. SCAN: Self-and-collaborative attention network for video person re-identification. IEEE Transactions on Image Processing Vol. 28, No. 10, 4870-4882, 2019.
[112]
Chen, D. P.; Li, H. S.; Xiao, T.; Yi, S.; Wang, X. G. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1169-1178, 2018.
DOI
[113]
Srivastava, R. K.; Greff, K.; Schmidhuber, J. Training very deep networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2377-2385, 2015.
[114]
Li, X.; Wang, W. H.; Hu, X. L.; Yang, J. Selective kernel networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 510-519, 2019.
DOI
[115]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. ResNeSt: Split-attention networks. arXiv preprint arXiv:2004.08955, 2020.
[116]
Chen, Y. P.; Dai, X. Y.; Liu, M. C.; Chen, D. D.; Yuan, L.; Liu, Z. C. Dynamic convolution: attention over convolution kernels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11027-11036, 2020.
DOI
[117]
Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I. S. BAM: Bottleneck attention module. arXiv preprint arXiv:1807.06514, 2018.
[118]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A simple, parameter-free attention module for convolutional neural networks. In: Proceedings of the 38th International Conference on Machine Learning, 11863-11874, 2021.
[119]
Wang, F.; Jiang, M. Q.; Qian, C.; Yang, S.; Li, C.; Zhang, H. G.; Wang, X.; Tang, X. Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6450-6458, 2017.
DOI
[120]
Guo, M.-H.; Lu, C.-Z.; Liu, Z.-N.; Cheng, M.-M.; Hu, S.-M. Visual attention network. arXiv preprint arXiv:2202.09741, 2022.
[121]
Liu, J. J.; Hou, Q. B.; Cheng, M. M.; Wang, C. H.; Feng, J. S. Improving convolutional networks with self-calibrated convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10093-10102, 2020.
DOI
[122]
Misra, D.; Nalamada, T.; Arasanipalai, A. U.; Hou, Q. B. Rotate to attend: Convolutional triplet attention module. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 3138-3147, 2021.
DOI
[123]
Linsley, .; Shiebler, D.; Eberhardt, S.; Serre, T. Learning what and where to attend. In: Proceedings of the 7th International Conference on Learning Representations, 2019.
[124]
Roy, A. G.; Navab, N.; Wachinger, C. Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation” blocks. IEEE Transactions on Medical Imaging Vol. 38, No. 2, 540-549, 2019.
[125]
Hou, Q. B.; Zhang, L.; Cheng, M. M.; Feng, J. S. Strip pooling: Rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4002-4011, 2020.
DOI
[126]
You, H. X.; Feng, Y. F.; Ji, R. R.; Gao, Y. PVNet: A joint convolutional network of point cloud and multi-view for 3D shape recognition. In: Proceedings of the 26th ACM International Conference on Multimedia, 1310-1318, 2018.
DOI
[127]
Xie, Q.; Lai, Y. K.; Wu, J.; Wang, Z. T.; Zhang, Y. M.; Xu, K.; Wang, J. MLCVNet: Multi-level context VoteNet for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10444-10453, 2020.
DOI
[128]
Wang, C.; Zhang, Q.; Huang, C.; Liu, W.; Wang, X. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11208. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 384-400, 2018.
[129]
Chen, T. L.; Ding, S. J.; Xie, J. Y.; Yuan, Y.; Chen, W. Y.; Yang, Y.; Ren, Z.; Wang, Z. ABD-net: Attentive but diverse person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 8350-8360, 2019.
DOI
[130]
Hou, Q. B.; Zhou, D. Q.; Feng, J. S. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13708-13717, 2021.
DOI
[131]
Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, 4263-4270, 2017.
[132]
Fu, Y.; Wang, X. Y.; Wei, Y. C.; Huang, T. STA: Spatial-temporal attention for large-scale video-based person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, 8287-8294, 2019.
[133]
Gao, L. L.; Li, X. P.; Song, J. K.; Shen, H. T. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 5, 1112-1131, 2020.
[134]
Yan, C. G.; Tu, Y. B.; Wang, X. Z.; Zhang, Y. B.; Hao, X. H.; Zhang, Y. D.; Dai, Q. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Transactions on Multimedia Vol. 22, No. 1, 229-241, 2020.
[135]
Meng, L. L.; Zhao, B.; Chang, B.; Huang, G.; Sun, W.; Tung, F.; Sigal, L. Interpretable spatio-temporal attention for video action recognition. In:Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 1513-1522, 2019.
DOI
[136]
He, B.; Yang, X. T.; Wu, Z. X.; Chen, H.; Shrivastava, A. GTA: Global temporal attention for video action understanding. arXiv preprint arXiv:2012.08510, 2020.
[137]
Li, S.; Bak, S.; Carr, P.; Wang, X. G. Diversity regularized spatiotemporal attention for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 369-378, 2018.
DOI
[138]
Zhang, Z. Z.; Lan, C. L.; Zeng, W. J.; Chen, Z. B. Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10404-10413, 2020.
DOI
[139]
Shim, M.; Ho, H. I.; Kim, J.; Wee, D. READ: Reciprocal attention discriminator for image-to-video re-identification. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12359. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 335-350, 2020.
[140]
Liu, R.; Deng, H. M.; Huang, Y. Y.; Shi, X. Y.; Li, H. S. Decoupled spatial-temporal transformer for video inpainting. arXiv preprint arXiv:2104.06637, 2021.
[141]
Chaudhari, S.; Mithal, V.; Polatkan, G.; Ramanath, R. An attentive survey of attention models. ACM Transactions on Intelligent Systems and Technology Vol. 12, No. 5, Article No. 53, 2021.
[142]
Xu, Y. F.; Wei, H. P.; Lin, M. X.; Deng, Y. Y.; Sheng, K. K.; Zhang, M. D.; Tang, F.; Dong, W.; Huang, F.; Xu, C. Transformers in computational visual media: A survey. Computational Visual Media Vol. 8, No. 1, 33-62, 2022.
[143]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on visual transformer. arXiv preprint arXiv:2012.12556, 2020.
[144]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S. W.; Khan, F. S.; Shah, M. Transformers in vision: A survey. ACM Computing Surveys , 2022.
[145]
Wang, F.; Tax, D. M. J. Survey on the attention based RNN model and its applications in computer vision. arXiv preprint arXiv:1601.06823, 2016.
[146]
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770-778, 2016.
[147]
Fang, P. F.; Zhou, J. M.; Roy, S.; Petersson, L.; Harandi, M. Bilinear attention networks for person retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 8029-8038, 2019.
DOI
[148]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Computation Vol. 9, No. 8, 1735-1780, 1997.
[149]
Sutton, R. S.; McAllester, D. A.; Singh, S. P.; Mansour, Y. Policy gradient methods for reinfor-cement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems, 1057-1063, 1999.
[150]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[151]
Lin, Z. H.; Feng, M. W.; Santos, C. N. D.; Yu, M.; Bengio, Y. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
[152]
Dai, Z. H.; Yang, Z. L.; Yang, Y. M.; Carbonell, J.; Le, Q.; Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2978-2988, 2019.
DOI
[153]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X. Y.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
[154]
Zhu, X. Z.; Su, W. J.; Lu, L. W.; Li, B.; Wang, X. G.; Dai, J. F. Deformable DETR: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations, 2021.
[155]
Liu, W.; Rabinovich, A.; Berg, A. C. ParseNet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
[156]
Peng, C.; Zhang, X. Y.; Yu, G.; Luo, G. M.; Sun, J. Large kernel matters—Improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1743-1751, 2017.
DOI
[157]
Zhao, H. S.; Shi, J. P.; Qi, X. J.; Wang, X. G.; Jia, J. Y. Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6230-6239, 2017.
DOI
[158]
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Computer Vision - ECCV 2014. Lecture Notes in Computer Science, Vol. 8691. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 346-361, 2014.
[159]
Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X. H.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. MLP-mixer: An all-MLP architecture for vision. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.
[160]
Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. ResMLP: Feedforward networks for image classification with data-efficient training. arXiv preprint arXiv: 2105.03404, 2021.
[161]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
[162]
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In: Proceedings of the 34th Conference on Neural Information Processing Systems, 2020.
[163]
Ba, J. L.; Kiros, J. R.; Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[164]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
[165]
Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, 843-852, 2017.
DOI
[166]
Deng, J.; Dong, W.; Socher, R.; Li, L. J.; Kai, L.; Li, F. F. ImageNet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248-255, 2009.
DOI
[167]
Zhou, D. Q.; Kang, B. Y.; Jin, X. J.; Yang, L. J.; Lian, X. C.; Jiang, Z. H.; Hou, Q. B.; Feng, J. S. DeepViT: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.
[168]
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 32-42, 2021.
DOI
[169]
Liu, R.; Deng, H. M.; Huang, Y. Y.; Shi, X. Y.; Lu, L. W.; Sun, W. X.; Wang, X.; Dai, J.; Li, H. FuseFormer: Fusing fine-grained information in transformers for video inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 14040-14049, 2021.
DOI
[170]
He, K. M.; Chen, X. L.; Xie, S. N.; Li, Y. H.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
[171]
Guo, M. H.; Liu, Z. N.; Mu, T. J.; Liang, D.; Martin, R. R.; Hu, S. M. Can attention enable MLPs to catch up with CNNs? Computational Visual Media Vol. 7, No. 3, 283-288, 2021.
[172]
Li, J. N.; Zhang, S. L.; Wang, J. D.; Gao, W.; Tian, Q. Global-local temporal representations for video person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3957-3966, 2019.
[173]
Liu, Z. Y.; Wang, L. M.; Wu, W.; Qian, C.; Lu, T. TAM: Temporal adaptive module for video recognition. arXiv preprint arXiv:2005.06803, 2020.
[174]
Yang, B.; Bender, G.; Le, Q. V.; Ngiam, J. CondConv: Conditionally parameterized convolutions for efficient inference. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article No. 117, 1307-1318, 2019.
[175]
Spillmann, L.; Dresp-Langley, B.; Tseng, C. H. Beyond the classical receptive field: The effect of contextual stimuli. Journal of Vision Vol. 15, No. 9, 7, 2015.
[176]
Xie, S. N.; Girshick, R.; Dollár, P.; Tu, Z. W.; He, K. M. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5987-5995, 2017.
DOI
[177]
Webb, B. S.; Dhruv, N. T.; Solomon, S. G.; Tailby, C.; Lennie, P. Early and late mechanisms of surround suppression in striate cortex of macaque. Journal of Neuroscience Vol. 25, No. 50, 11666-11675, 2005.
[178]
Yang, J. R.; Zheng, W. S.; Yang, Q. Z.; Chen, Y. C.; Tian, Q. Spatial-temporal graph convolutional network for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3286-3296, 2020.
DOI
[179]
Szegedy, C.; Liu, W.; Jia, Y. Q.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1-9, 2015.
DOI
[180]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650-9660, 2021.
DOI
[181]
Qian, N. On the momentum term in gradient descent learning algorithms. Neural Networks Vol. 12, No. 1, 145-151, 1999.
[182]
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[183]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[184]
Chen, X. N.; Hsieh, C. J.; Gong, B. Q. When vision transformers outperform ResNets without pretraining or strong data augmentations. arXiv preprint arXiv:2106.01548, 2021.
[185]
Foret, P.; Kleiner, A.; Mobahi, H.; Neyshabur, B. Sharpness-aware minimization for efficiently impro-ving generalization. arXiv preprint arXiv:2010.01412, 2020.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 31 December 2021
Accepted: 18 January 2022
Published: 15 March 2022
Issue date: September 2022

Copyright

© The Author(s) 2022.

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61521002 and 62132012). We would like to thank Cheng-Ze Lu, Zhengyang Geng, Shilong Liu, He Wang, Huiying Lu, and Chenxi Huang for their helpful discussions and insightful suggestions.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.

Return