References(185)
[1]
Itti, L.; Koch, C.; Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 20, No. 11, 1254-1259, 1998.
[2]
Hayhoe, M.; Ballard, D. Eye movements in natural behavior. Trends in Cognitive Sciences Vol. 9, No. 4, 188-194, 2005
[3]
Rensink, R. A. The dynamic representation of scenes. Visual Cognition Vol. 7, Nos. 1-3, 17-42, 2000.
[4]
Corbetta, M.; Shulman, G. L. Control of goal-directed and stimulus-driven attention in the brain. Nature Reviews Neuroscience Vol. 3, No. 3, 201-215, 2002.
[5]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. H. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 8, 2011-2023, 2020.
[6]
Woo, S.; Park, J.; Lee, J.; Kweon, I. S. CBAM: Convolutional block attention module. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Smin-chisescu, C.; Weiss, Y. Eds. Springer Cham, 3-19, 2018.
[7]
Dai, J. F.; Qi, H. Z.; Xiong, Y. W.; Li, Y.; Zhang, G. D.; Hu, H.; Wei, Y. Deformable convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, 764-773, 2017.
[8]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 213-229, 2020.
[9]
Yuan, Y.; Wang, J. OCNet: Object context network for scene parsing. arXiv preprint arXiv:1809.00916, 2018.
[10]
Fu, J.; Liu, J.; Tian, H. J.; Li, Y.; Bao, Y. J.; Fang, Z. W.; Lu, H. Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3141-3149, 2019.
[11]
Yang, J. L.; Ren, P. R.; Zhang, D. Q.; Chen, D.; Wen, F.; Li, H. D.; Hua, G. Neural aggregation network for video face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5216-5225, 2017.
[12]
Wang, Q. C.; Wu, T. Y.; Zheng, H.; Guo, G. D. Hierarchical pyramid diverse attention networks for face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8323-8332, 2020.
[13]
Li, W.; Zhu, X. T.; Gong, S. G. Harmonious attention network for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2285-2294, 2018.
[14]
Chen, B. H.; Deng, W. H.; Hu, J. N. Mixed high-order attention network for person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 371-381, 2019.
[15]
Wang, X. L.; Girshick, R.; Gupta, A.; He, K. M. Non-local neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794-7803, 2018.
[16]
Du, W. B.; Wang, Y. L.; Qiao, Y. Recurrent spatial-temporal attention network for action recognition in videos. IEEE Transactions on Image Processing Vol. 27, No. 3, 1347-1360, 2018.
[17]
Peng, Y. X.; He, X. T.; Zhao, J. J. Object-part attention model for fine-grained image classification. IEEE Transactions on Image Processing Vol. 27, No. 3, 1487-1500, 2018.
[18]
He, P.; Huang, W. L.; He, T.; Zhu, Q. L.; Qiao, Y.; Li, X. L. Single shot text detector with regional attention. In: Proceedings of the IEEE International Conference on Computer Vision, 3066-3074, 2017.
[19]
Oktay, O.; Schlemper, J.; Folgoc, L. L.; Lee, M.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N. Y.; Kainz, B.; et al. Attention U-Net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018.
[20]
Guan, Q.; Huang, Y.; Zhong, Z.; Zheng, Z.; Zheng, L.; Yang, Y. Diagnose like a radiologist: Attention guided convolutional neural network for thorax disease classification. arXiv preprint arXiv:1801.09927, 2018.
[21]
Gregor, K.; Danihelka, I.; Graves, A.; Wierstra, D. DRAW: A recurrent neural network for image generation. In: Proceedings of the 32nd International Conference on Machine Learning, 1462-1471, 2015.
[22]
Zhang, H.; Goodfellow, I. J.; Metaxas, D. N.; Odena, A. Self-attention generative adversarial networks. In: Proceedings of the 36th International Conference on Machine Learning, 7354-7363, 2019.
[23]
Chu, X.; Yang, W.; Ouyang, W. L.; Ma, C.; Yuille, A. L.; Wang, X. G. Multi-context attention for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5669-5678, 2017.
[24]
Dai, T.; Cai, J. R.; Zhang, Y. B.; Xia, S. T.; Zhang, L. Second-order attention network for single image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11057-11066, 2019.
[25]
Zhang, Y. L.; Li, K. P.; Li, K.; Wang, L. C.; Zhong, B. N.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 294-310, 2018.
[26]
Xie, S. N.; Liu, S. N.; Chen, Z. Y.; Tu, Z. W. Attentional ShapeContextNet for point cloud recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4606-4615, 2018.
[27]
Guo, M. H.; Cai, J. X.; Liu, Z. N.; Mu, T. J.; Martin, R. R.; Hu, S. M. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187-199, 2021.
[28]
Su, W. J.; Zhu, X. Z.; Cao, Y.; Li, B.; Lu, L. W.; Wei, F. R.; Dai, J. L-BERT: Pre-training of generic visual-linguistic representations. In: Proceedings of the International Conference on Learning Representations, 2020.
[29]
Xu, T.; Zhang, P. C.; Huang, Q. Y.; Zhang, H.; Gan, Z.; Huang, X. L.; He, X. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316-1324, 2018.
[30]
Wu, Y. X.; He, K. M. Group normalization. International Journal of Computer Vision Vol. 128, No. 3, 742-755, 2020.
[31]
Mnih, V.; Heess, N.; Graves, A.; Kavukcuoglu, K. Recurrent models of visual attention. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, Vol. 2, 2204-2212, 2014.
[32]
Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2017-2025, 2015.
[33]
Vaswani, A.; Shazeer, N. M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing System, 6000-6010, 2017.
[34]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16×16 words: Transformers for image recognition at scale. In: Proceedings of the 9th International Conference on Learning Representations, 2021.
[35]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning, 2048-2057, 2015.
[36]
Zhu, X. Z.; Hu, H.; Lin, S.; Dai, J. F. Deformable ConvNets V2: More deformable, better results. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9300-9308, 2019.
[37]
Wang, Q. L.; Wu, B. G.; Zhu, P. F.; Li, P. H.; Zuo, W. M.; Hu, Q. H. ECA-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11531-11539, 2020.
[38]
Devlin, J.; Chang, M. W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[39]
Yang, Z. L.; Dai, Z. H.; Yang, Y. M.; Carbonell, J. G.; Salakhutdinov, R.; Le, Q. V. XLNet: Generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, 2019.
[40]
Li, X.; Zhong, Z. S.; Wu, J. L.; Yang, Y. B.; Lin, Z. C.; Liu, H. Expectation-maximization attention networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9166-9175, 2019.
[41]
Huang, Z. L.; Wang, X. G.; Huang, L. C.; Huang, C.; Wei, Y. C.; Liu, W. Y. CCNet: Criss-cross attention for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 2020.
[42]
Geng, Z.; Guo, M.-H.; Chen, H.; Li, X.; Wei, K.; Lin, Z. Is attention better than matrix decomposition? In: Proceedings of the International Conference on Learning Representations, 2021.
[43]
Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J. Stand-alone self-attention in vision models. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, 2019.
[44]
Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.-H.; Tay, F. E.; Feng, J.; Yan, S. Tokens-to-Token ViT: Training vision transformers from scratch on ImageNet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 558-567, 2021.
[45]
Wang, W. H.; Xie, E. Z.; Li, X.; Fan, D. P.; Song, K. T.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Visio, 568-578, 2021.
[46]
Liu, Z.; Lin, Y. T.; Cao, Y.; Hu, H.; Guo, B. N. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10012-10022, 2021.
[47]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing convolutions to vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 22-31, 2021.
[48]
Yuan, L.; Hou, Q. B.; Jiang, Z. H.; Feng, J. S.; Yan, S. C. VOLO: Vision outlooker for visual recognition. arXiv preprint arXiv:2106.13112, 2021.
[49]
Dai, Z. H.; Liu, H. X.; Le, Q. V.; Tan, M. X. CoAtNet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803, 2021.
[50]
Chen, L.; Zhang, H. W.; Xiao, J.; Nie, L. Q.; Shao, J.; Liu, W.; Chua, T. SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6298-6306, 2017.
[51]
Nair, V.; Hinton, G. E. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning, 807-814, 2010.
[52]
Ioffe, S.; Szegedy, C. Batch normalization: Accele-rating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning, Vol. 37, 448-456, 2015.
[53]
Zhang, H.; Dana, K.; Shi, J. P.; Zhang, Z. Y.; Wang, X. G.; Tyagi, A.; Agrawal, A. Context encoding for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7151-7160, 2018.
[54]
Gao, Z. L.; Xie, J. T.; Wang, Q. L.; Li, P. H. Global second-order pooling convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3019-3028, 2019.
[55]
Lee, H.; Kim, H. E.; Nam, H. SRM: A style-based recalibration module for convolutional neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1854-1862, 2019.
[56]
Yang, Z. X.; Zhu, L. C.; Wu, Y.; Yang, Y. Gated channel transformation for visual recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11791-11800, 2020.
[57]
Qin, Z. Q.; Zhang, P. Y.; Wu, F.; Li, X. FcaNet: Frequency channel attention networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 783-792, 2021.
[58]
Diba, A. L.; Fayyaz, M.; Sharma, V.; Arzani, M. M.; Yousefzadeh, R.; Gall, J.; van Gool, L. Spatio-temporal channel correlation networks for action classification. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11208. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springe Cham, 299-315, 2018.
[59]
Chen, Z. R.; Li, Y.; Bengio, S.; Si, S. You look twice: GaterNet for dynamic filter selection in CNNs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9164-9172, 2019.
[60]
Shi, H. Y.; Lin, G. S.; Wang, H.; Hung, T. Y.; Wang, Z. H. SpSequenceNet: Semantic segmentation network on 4D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4573-4582, 2020.
[61]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Vedaldi, A. Gather-excite: Exploiting feature context in convolutional neural networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 9423-9433, 2018.
[62]
Yan, X.; Zheng, C. D.; Li, Z.; Wang, S.; Cui, S. G. PointASNL: Robust point clouds processing using nonlocal neural networks with adaptive sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5588-5597, 2020.
[63]
Hu, H.; Gu, J. Y.; Zhang, Z.; Dai, J. F.; Wei, Y. C. Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3588-3597, 2018.
[64]
Zhang, H.; Zhang, H.; Wang, C. G.; Xie, J. Y. Co-occurrent features in semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 548-557, 2019.
[65]
Bello, I.; Zoph, B.; Le, Q.; Vaswani, A.; Shlens, J. Attention augmented convolutional networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3285-3294, 2019.
[66]
Zhu, X. Z.; Cheng, D. Z.; Zhang, Z.; Lin, S.; Dai, J. F. An empirical study of spatial attention mechanisms in deep networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6687-6696, 2019.
[67]
Li, X.; Yang, Y. B.; Zhao, Q. J.; Shen, T. C.; Lin, Z. C.; Liu, H. Spatial pyramid based graph reasoning for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8947-8956, 2020.
[68]
Zhu, Z.; Xu, M. D.; Bai, S.; Huang, T. T.; Bai, X. Asymmetric non-local neural networks for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 593-602, 2019.
[69]
Cao, Y.; Xu, J. R.; Lin, S.; Wei, F. Y.; Hu, H. GCNet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 1971-1980, 2019.
[70]
Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. A2-nets: Double attention networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 350-359, 2018.
[71]
Chen, Y. P.; Rohrbach, M.; Yan, Z. C.; Yan, S. C.; Feng, J. S.; Kalantidis, Y. Graph-based global reasoning networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 433-442, 2019.
[72]
Zhang, S. Y.; Yan, S. P.; He, X. M. LatentGNN: Learning efficient non-local relations for visual recognition. In: Proceedings of the 36th International Conference on Machine Learning, 7374-7383, 2019.
[73]
Yuan, Y.; Chen, X.; Chen, X.; Wang, J. Segmen-tation transformer: Object-contextual representations for semantic segmentation. arXiv preprint arXiv: 1909.11065, 2019.
[74]
Yin, M. H.; Yao, Z. L.; Cao, Y.; Li, X.; Zhang, Z.; Lin, S.; Hu, H. Disentangled non-local neural networks. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12360. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 191-207, 2020.
[75]
Guo, M. H.; Liu, Z. N.; Mu, T. J.; Hu, S. M. Beyond self-attention: External attention using two linear layers for visual tasks. arXiv preprint arXiv:2105.02358, 2021.
[76]
Hu, H.; Zhang, Z.; Xie, Z. D.; Lin, S. Local relation networks for image recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3463-3472, 2019.
[77]
Zhao, H. S.; Jia, J. Y.; Koltun, V. Exploring self-attention for image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10073-10082, 2020.
[78]
Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In: Proceedings of the 37th International Conference on Machine Learning, 1691-1703, 2020.
[79]
Chen, H. T.; Wang, Y. H.; Guo, T. Y.; Xu, C.; Deng, Y. P.; Liu, Z. H.; Ma, S.; Xu, C.; Xu, C.; Gao, W. Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12294-12305, 2021.
[80]
Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 16259-16268, 2021.
[81]
Zheng, S. X.; Lu, J. C.; Zhao, H. S.; Zhu, X. T.; Luo, Z. K.; Wang, Y. B.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P. H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6877-6886, 2021.
[82]
Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y. Transformer in transformer. arXiv preprint arXiv:2103.00112, 2021.
[83]
Liu, S. L.; Zhang, L.; Yang, X.; Su, H.; Zhu, J. Query2Label: A simple transformer way to multi-label classification. arXiv preprint arXiv:2107.10834, 2021.
[84]
Chen, X. L.; Xie, S. N.; He, K. M. An empirical study of training self-supervised visual transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9640-9649, 2021.
[85]
Bao, H. B.; Dong, L.; Wei, F. R. BEiT: BERT pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
[86]
Xie, E. Z.; Wang, W. H.; Yu, Z. D.; Anandkumar, A.; Álvarez, J.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203, 2021.
[87]
Zhao, H.; Zhang, Y.; Liu, S.; Shi, J.; Loy, C. C.; Lin, D.; Jia, J. PSANet: Point-wise spatial attention network for scene parsing. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11213. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 270-286, 2018.
[88]
Ba, J.; Mnih, V.; Kavukcuoglu, K. Multiple object recognition with visual attention. arXiv preprint arXiv:1412.7755, 2014.
[89]
Sharma, S.; Kiros, R.; Salakhutdinov, R. Action recognition using visual attention. arXiv preprint arXiv:1511.04119, 2015.
[90]
Girdhar, R.; Ramanan, D. Attentional pooling for action recognition. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 33-44, 2017.
[91]
Li, Z. Y.; Gavrilyuk, K.; Gavves, E.; Jain, M.; Snoek, C. G. M. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding Vol. 166, 41-50, 2018.
[92]
Yue, K. Y.; Sun, M.; Yuan, Y. C.; Zhou, F.; Ding, E. R.; Xu, F. X. Compact generalized non-local network. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6511-6520, 2018.
[93]
Liu, X. H.; Han, Z. Z.; Wen, X.; Liu, Y. S.; Zwicker, M. L2G auto-encoder: Understanding point clouds by local-to-global reconstruction with hierarchical self-attention. In: Proceedings of the 27th ACM International Conference on Multimedia, 989-997, 2019.
[94]
Paigwar, A.; Erkent, O.; Wolf, C.; Laugier, C. Attentional PointNet for 3D-object detection in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 1297-1306, 2019.
[95]
Wen, X.; Han, Z. Z.; Youk, G.; Liu, Y. S. CF-SIS: Semantic-instance segmentation of 3D point clouds by context fusion with self-attention. In: Proceedings of the 28th ACM International Conference on Multimedia, 1661-1669, 2020.
[96]
Yang, J. C.; Zhang, Q.; Ni, B. B.; Li, L. G.; Liu, J. X.; Zhou, M. D.; Tian, Q. Modeling point clouds with self-attention and gumbel subset sampling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3318-3327, 2019.
[97]
Xu, J.; Zhao, R.; Zhu, F.; Wang, H. M.; Ouyang, W. L. Attention-aware compositional network for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2119-2128, 2018.
[98]
Liu, H.; Feng, J. S.; Qi, M. B.; Jiang, J. G.; Yan, S. C. End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing Vol. 26, No. 7, 3492-3506, 2017.
[99]
Zheng, Z. D.; Zheng, L.; Yang, Y. Pedestrian alignment network for large-scale person re-identification. IEEE Transactions on Circuits and Systems for Video Technology Vol. 29, No. 10, 3037-3045, 2019.
[100]
Li, K. P.; Wu, Z. Y.; Peng, K. C.; Ernst, J.; Fu, Y. Tell me where to look: Guided attention inference network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9215-9223, 2018.
[101]
Zhang, Z. Z.; Lan, C. L.; Zeng, W. J.; Jin, X.; Chen, Z. B. Relation-aware global attention for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3183-3192, 2020.
[102]
Zhao, B.; Wu, X.; Feng, J. S.; Peng, Q.; Yan, S. C. Diversified visual attention networks for fine-grained object classification. IEEE Transactions on Multimedia Vol. 19, No. 6, 1245-1256, 2017.
[103]
Bryan, B.; Gong, Y.; Zhang, Y. Z.; Poellabauer, C. Second-order non-local attention networks for person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3759-3768, 2019.
[104]
Zheng, H. L.; Fu, J. L.; Mei, T.; Luo, J. B. Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE International Conference on Computer Vision, 5219-5227, 2017.
[105]
Fu, J. L.; Zheng, H. L.; Mei, T. Look closer to see better: Recurrent attention convolutionalneural network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4476-4484, 2017.
[106]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes are better queries for DETR. arXiv preprint arXiv:2201.12329, 2022.
[107]
Yang, G. Y.; Li, X. L.; Martin, R.; Hu, S. M. Sampling equivariant self-attention networks for object detection in aerial images. arXiv preprint arXiv:2111.03420, 2021.
[108]
Zheng, H. L.; Fu, J. L.; Zha, Z. J.; Luo, J. B. Looking for the devil in the details: Learning trilinear attention sampling network for fine-grained image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5007-5016, 2019.
[109]
Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A. R.; Choi, S.; Teh Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In: Proceedings of the 36th International Conference on Machine Learning, 3744-3753, 2019.
[110]
Xu, S. J.; Cheng, Y.; Gu, K.; Yang, Y.; Chang, S. Y.; Zhou, P. Jointly attentive spatial-temporal pooling networks for video-based person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision, 4743-4752, 2017.
[111]
Zhang, R. M.; Li, J. Y.; Sun, H. B.; Ge, Y. Y.; Luo, P.; Wang, X. G.; Lin, L. SCAN: Self-and-collaborative attention network for video person re-identification. IEEE Transactions on Image Processing Vol. 28, No. 10, 4870-4882, 2019.
[112]
Chen, D. P.; Li, H. S.; Xiao, T.; Yi, S.; Wang, X. G. Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1169-1178, 2018.
[113]
Srivastava, R. K.; Greff, K.; Schmidhuber, J. Training very deep networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2377-2385, 2015.
[114]
Li, X.; Wang, W. H.; Hu, X. L.; Yang, J. Selective kernel networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 510-519, 2019.
[115]
Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Lin, H.; Zhang, Z.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. ResNeSt: Split-attention networks. arXiv preprint arXiv:2004.08955, 2020.
[116]
Chen, Y. P.; Dai, X. Y.; Liu, M. C.; Chen, D. D.; Yuan, L.; Liu, Z. C. Dynamic convolution: attention over convolution kernels. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11027-11036, 2020.
[117]
Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I. S. BAM: Bottleneck attention module. arXiv preprint arXiv:1807.06514, 2018.
[118]
Yang, L.; Zhang, R.-Y.; Li, L.; Xie, X. SimAM: A simple, parameter-free attention module for convolutional neural networks. In: Proceedings of the 38th International Conference on Machine Learning, 11863-11874, 2021.
[119]
Wang, F.; Jiang, M. Q.; Qian, C.; Yang, S.; Li, C.; Zhang, H. G.; Wang, X.; Tang, X. Residual attention network for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6450-6458, 2017.
[120]
Guo, M.-H.; Lu, C.-Z.; Liu, Z.-N.; Cheng, M.-M.; Hu, S.-M. Visual attention network. arXiv preprint arXiv:2202.09741, 2022.
[121]
Liu, J. J.; Hou, Q. B.; Cheng, M. M.; Wang, C. H.; Feng, J. S. Improving convolutional networks with self-calibrated convolutions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10093-10102, 2020.
[122]
Misra, D.; Nalamada, T.; Arasanipalai, A. U.; Hou, Q. B. Rotate to attend: Convolutional triplet attention module. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 3138-3147, 2021.
[123]
Linsley, .; Shiebler, D.; Eberhardt, S.; Serre, T. Learning what and where to attend. In: Proceedings of the 7th International Conference on Learning Representations, 2019.
[124]
Roy, A. G.; Navab, N.; Wachinger, C. Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation” blocks. IEEE Transactions on Medical Imaging Vol. 38, No. 2, 540-549, 2019.
[125]
Hou, Q. B.; Zhang, L.; Cheng, M. M.; Feng, J. S. Strip pooling: Rethinking spatial pooling for scene parsing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4002-4011, 2020.
[126]
You, H. X.; Feng, Y. F.; Ji, R. R.; Gao, Y. PVNet: A joint convolutional network of point cloud and multi-view for 3D shape recognition. In: Proceedings of the 26th ACM International Conference on Multimedia, 1310-1318, 2018.
[127]
Xie, Q.; Lai, Y. K.; Wu, J.; Wang, Z. T.; Zhang, Y. M.; Xu, K.; Wang, J. MLCVNet: Multi-level context VoteNet for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10444-10453, 2020.
[128]
Wang, C.; Zhang, Q.; Huang, C.; Liu, W.; Wang, X. Mancs: A multi-task attentional network with curriculum sampling for person re-identification. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11208. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 384-400, 2018.
[129]
Chen, T. L.; Ding, S. J.; Xie, J. Y.; Yuan, Y.; Chen, W. Y.; Yang, Y.; Ren, Z.; Wang, Z. ABD-net: Attentive but diverse person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 8350-8360, 2019.
[130]
Hou, Q. B.; Zhou, D. Q.; Feng, J. S. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13708-13717, 2021.
[131]
Song, S.; Lan, C.; Xing, J.; Zeng, W.; Liu, J. An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence, 4263-4270, 2017.
[132]
Fu, Y.; Wang, X. Y.; Wei, Y. C.; Huang, T. STA: Spatial-temporal attention for large-scale video-based person re-identification. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, 8287-8294, 2019.
[133]
Gao, L. L.; Li, X. P.; Song, J. K.; Shen, H. T. Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 5, 1112-1131, 2020.
[134]
Yan, C. G.; Tu, Y. B.; Wang, X. Z.; Zhang, Y. B.; Hao, X. H.; Zhang, Y. D.; Dai, Q. STAT: Spatial-temporal attention mechanism for video captioning. IEEE Transactions on Multimedia Vol. 22, No. 1, 229-241, 2020.
[135]
Meng, L. L.; Zhao, B.; Chang, B.; Huang, G.; Sun, W.; Tung, F.; Sigal, L. Interpretable spatio-temporal attention for video action recognition. In:Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 1513-1522, 2019.
[136]
He, B.; Yang, X. T.; Wu, Z. X.; Chen, H.; Shrivastava, A. GTA: Global temporal attention for video action understanding. arXiv preprint arXiv:2012.08510, 2020.
[137]
Li, S.; Bak, S.; Carr, P.; Wang, X. G. Diversity regularized spatiotemporal attention for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 369-378, 2018.
[138]
Zhang, Z. Z.; Lan, C. L.; Zeng, W. J.; Chen, Z. B. Multi-granularity reference-aided attentive feature aggregation for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10404-10413, 2020.
[139]
Shim, M.; Ho, H. I.; Kim, J.; Wee, D. READ: Reciprocal attention discriminator for image-to-video re-identification. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12359. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 335-350, 2020.
[140]
Liu, R.; Deng, H. M.; Huang, Y. Y.; Shi, X. Y.; Li, H. S. Decoupled spatial-temporal transformer for video inpainting. arXiv preprint arXiv:2104.06637, 2021.
[141]
Chaudhari, S.; Mithal, V.; Polatkan, G.; Ramanath, R. An attentive survey of attention models. ACM Transactions on Intelligent Systems and Technology Vol. 12, No. 5, Article No. 53, 2021.
[142]
Xu, Y. F.; Wei, H. P.; Lin, M. X.; Deng, Y. Y.; Sheng, K. K.; Zhang, M. D.; Tang, F.; Dong, W.; Huang, F.; Xu, C. Transformers in computational visual media: A survey. Computational Visual Media Vol. 8, No. 1, 33-62, 2022.
[143]
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on visual transformer. arXiv preprint arXiv:2012.12556, 2020.
[144]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S. W.; Khan, F. S.; Shah, M. Transformers in vision: A survey. ACM Computing Surveys , 2022.
[145]
Wang, F.; Tax, D. M. J. Survey on the attention based RNN model and its applications in computer vision. arXiv preprint arXiv:1601.06823, 2016.
[146]
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770-778, 2016.
[147]
Fang, P. F.; Zhou, J. M.; Roy, S.; Petersson, L.; Harandi, M. Bilinear attention networks for person retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 8029-8038, 2019.
[148]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Computation Vol. 9, No. 8, 1735-1780, 1997.
[149]
Sutton, R. S.; McAllester, D. A.; Singh, S. P.; Mansour, Y. Policy gradient methods for reinfor-cement learning with function approximation. In: Proceedings of the 12th International Conference on Neural Information Processing Systems, 1057-1063, 1999.
[150]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[151]
Lin, Z. H.; Feng, M. W.; Santos, C. N. D.; Yu, M.; Bengio, Y. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
[152]
Dai, Z. H.; Yang, Z. L.; Yang, Y. M.; Carbonell, J.; Le, Q.; Salakhutdinov, R. Transformer-XL: Attentive language models beyond a fixed-length context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2978-2988, 2019.
[153]
Choromanski, K.; Likhosherstov, V.; Dohan, D.; Song, X. Y.; Gane, A.; Sarlos, T.; Hawkins, P.; Davis, J.; Mohiuddin, A.; Kaiser, L.; et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
[154]
Zhu, X. Z.; Su, W. J.; Lu, L. W.; Li, B.; Wang, X. G.; Dai, J. F. Deformable DETR: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations, 2021.
[155]
Liu, W.; Rabinovich, A.; Berg, A. C. ParseNet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
[156]
Peng, C.; Zhang, X. Y.; Yu, G.; Luo, G. M.; Sun, J. Large kernel matters—Improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1743-1751, 2017.
[157]
Zhao, H. S.; Shi, J. P.; Qi, X. J.; Wang, X. G.; Jia, J. Y. Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6230-6239, 2017.
[158]
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. In: Computer Vision - ECCV 2014. Lecture Notes in Computer Science, Vol. 8691. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 346-361, 2014.
[159]
Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X. H.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. MLP-mixer: An all-MLP architecture for vision. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021.
[160]
Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. ResMLP: Feedforward networks for image classification with data-efficient training. arXiv preprint arXiv: 2105.03404, 2021.
[161]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
[162]
Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. In: Proceedings of the 34th Conference on Neural Information Processing Systems, 2020.
[163]
Ba, J. L.; Kiros, J. R.; Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[164]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
[165]
Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A. Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision, 843-852, 2017.
[166]
Deng, J.; Dong, W.; Socher, R.; Li, L. J.; Kai, L.; Li, F. F. ImageNet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 248-255, 2009.
[167]
Zhou, D. Q.; Kang, B. Y.; Jin, X. J.; Yang, L. J.; Lian, X. C.; Jiang, Z. H.; Hou, Q. B.; Feng, J. S. DeepViT: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886, 2021.
[168]
Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H. Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 32-42, 2021.
[169]
Liu, R.; Deng, H. M.; Huang, Y. Y.; Shi, X. Y.; Lu, L. W.; Sun, W. X.; Wang, X.; Dai, J.; Li, H. FuseFormer: Fusing fine-grained information in transformers for video inpainting. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 14040-14049, 2021.
[170]
He, K. M.; Chen, X. L.; Xie, S. N.; Li, Y. H.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
[171]
Guo, M. H.; Liu, Z. N.; Mu, T. J.; Liang, D.; Martin, R. R.; Hu, S. M. Can attention enable MLPs to catch up with CNNs? Computational Visual Media Vol. 7, No. 3, 283-288, 2021.
[172]
Li, J. N.; Zhang, S. L.; Wang, J. D.; Gao, W.; Tian, Q. Global-local temporal representations for video person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3957-3966, 2019.
[173]
Liu, Z. Y.; Wang, L. M.; Wu, W.; Qian, C.; Lu, T. TAM: Temporal adaptive module for video recognition. arXiv preprint arXiv:2005.06803, 2020.
[174]
Yang, B.; Bender, G.; Le, Q. V.; Ngiam, J. CondConv: Conditionally parameterized convolutions for efficient inference. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article No. 117, 1307-1318, 2019.
[175]
Spillmann, L.; Dresp-Langley, B.; Tseng, C. H. Beyond the classical receptive field: The effect of contextual stimuli. Journal of Vision Vol. 15, No. 9, 7, 2015.
[176]
Xie, S. N.; Girshick, R.; Dollár, P.; Tu, Z. W.; He, K. M. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5987-5995, 2017.
[177]
Webb, B. S.; Dhruv, N. T.; Solomon, S. G.; Tailby, C.; Lennie, P. Early and late mechanisms of surround suppression in striate cortex of macaque. Journal of Neuroscience Vol. 25, No. 50, 11666-11675, 2005.
[178]
Yang, J. R.; Zheng, W. S.; Yang, Q. Z.; Chen, Y. C.; Tian, Q. Spatial-temporal graph convolutional network for video-based person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3286-3296, 2020.
[179]
Szegedy, C.; Liu, W.; Jia, Y. Q.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1-9, 2015.
[180]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9650-9660, 2021.
[181]
Qian, N. On the momentum term in gradient descent learning algorithms. Neural Networks Vol. 12, No. 1, 145-151, 1999.
[182]
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[183]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
[184]
Chen, X. N.; Hsieh, C. J.; Gong, B. Q. When vision transformers outperform ResNets without pretraining or strong data augmentations. arXiv preprint arXiv:2106.01548, 2021.
[185]
Foret, P.; Kleiner, A.; Mobahi, H.; Neyshabur, B. Sharpness-aware minimization for efficiently impro-ving generalization. arXiv preprint arXiv:2010.01412, 2020.