Journal Home > Volume 1 , Issue 2

With the significant breakthrough in the research of single-modal related deep learning tasks, more and more works begin to focus on multi-modal tasks. Multi-modal tasks usually involve more than one different modalities, and a modality represents a type of behavior or state. Common multi-modal information includes vision, hearing, language, touch, and smell. Vision and language are two of the most common modalities in human daily life, and many typical multi-modal tasks focus on these two modalities, such as visual captioning and visual grounding. In this paper, we conduct in-depth research on typical tasks of vision and language from the perspectives of generation, analysis, and reasoning. First, the analysis and summary with the typical tasks and some pretty classical methods are introduced, which will be generalized from the aspects of different algorithmic concerns, and be further discussed frequently used datasets and metrics. Then, some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework. Finally, we further discuss the development of pre-training related research and make an outlook for future research. We hope this survey can help relevant researchers to understand the latest progress, existing problems, and exploration directions of vision and language multi-modal related tasks, and provide guidance for future research.


menu
Abstract
Full text
Outline
About this article

A Survey of Vision and Language Related Multi-Modal Task

Show Author's information Lanxiao Wang1Wenzhe Hu1Heqian Qiu1Chao Shang1Taijin Zhao1Benliu Qiu1King Ngi Ngan2Hongliang Li1( )
Department of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
The Chinese University of Hong Kong, Hong Kong 999077, China

Abstract

With the significant breakthrough in the research of single-modal related deep learning tasks, more and more works begin to focus on multi-modal tasks. Multi-modal tasks usually involve more than one different modalities, and a modality represents a type of behavior or state. Common multi-modal information includes vision, hearing, language, touch, and smell. Vision and language are two of the most common modalities in human daily life, and many typical multi-modal tasks focus on these two modalities, such as visual captioning and visual grounding. In this paper, we conduct in-depth research on typical tasks of vision and language from the perspectives of generation, analysis, and reasoning. First, the analysis and summary with the typical tasks and some pretty classical methods are introduced, which will be generalized from the aspects of different algorithmic concerns, and be further discussed frequently used datasets and metrics. Then, some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework. Finally, we further discuss the development of pre-training related research and make an outlook for future research. We hope this survey can help relevant researchers to understand the latest progress, existing problems, and exploration directions of vision and language multi-modal related tasks, and provide guidance for future research.

Keywords: deep learning, pre-training, vision and language, multi-modal generation, multi-modal analysis, multi-modal reasoning

References(244)

[1]
J. Y. Pan, H. J. Yang, P. Duygulu, and C. Faloutsos, Automatic image captioning, in Proc. 2004 IEEE Int. Conf. Multimedia and Expo, Taipei, China, 2004, pp. 1987–1990.
[2]
A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, Every picture tells a story: Generating sentences from images, in Proc. 11th European Conf. Computer Vision, Heraklion, Greece, 2010, pp. 15–29.
DOI
[3]
V. Ordonez, G. Kulkarni, and T. L. Berg, Im2Text: Describing images using 1 million captioned photographs, in Proc. 24th Int. Conf. Neural Information Processing Systems, Granada, Spain, 2011, pp. 1143–1151.
[4]
Y. Yang, C. Teo, H. Daumé, and Y. Aloimonos, Corpus-guided sentence generation of natural images, in Proc. Conf. Empirical Methods in Natural Language Processing, Edinburgh, UK, 2011, pp. 444–454.
[5]

G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, BabyTalk: Understanding and generating simple image descriptions, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 2891–2903, 2013.

[6]
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li, ImageNet: A large-scale hierarchical image database, in Proc. 2009 IEEE Conf. Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 248–255.
DOI
[7]
A. Karpathy and F. F. Li, Deep visual-semantic alignments for generating image descriptions, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3128–3137.
DOI
[8]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in Proc. 25th Int. Conf. Neural Information Processing Systems, Lake Tahoe, NV, USA, 2012, pp. 1097–1105.
[9]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proc. 2014 IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 580–587.
DOI
[10]
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3156–3164.
DOI
[11]
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 1–9.
DOI
[12]
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, Deep captioning with multimodal recurrent neural networks (m-RNN), in Proc. 3rd Int. Conf. Learning Representations, San Diego, CA, USA, 2015, pp. 1–17.
[13]
J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko, Long-term recurrent convolutional networks for visual recognition and description, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 2625–2634.
DOI
[14]
X. Chen and C. L. Zitnick, Mind’s eye: A recurrent visual representation for image caption generation, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 2422–2431.
DOI
[15]
X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, Guiding the long-short term memory model for image caption generation, in Proc. 2015 Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2407–2415.
DOI
[16]
Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, Image captioning with semantic attention, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4651–4659.
DOI
[17]
K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. S. Zemel, and Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in Proc. 32nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 2048–2057.
[18]
Y. Wang, Z. Lin, X. Shen, S. Cohen, and G. W. Cottrell, Skeleton key: Image captioning by skeleton-attribute decomposition, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 7378–7387.
DOI
[19]
W. Jiang, L. Ma, Y. G. Jiang, W. Liu, and T. Zhang, Recurrent fusion network for image captioning, in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 510–526.
DOI
[20]
L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. S. Chua, SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 6298–6306.
DOI
[21]
J. Lu, C. Xiong, D. Parikh, and R. Socher, Knowing when to look: Adaptive attention via a visual sentinel for image captioning, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3242–3250.
DOI
[22]
V. Ramanishka, A. Das, J. Zhang, and K. Saenko, Top-down visual saliency guided by captions, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3135–3144.
DOI
[23]

M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, Paying more attention to saliency: Image captioning with saliency and context attention, ACM Trans. Multimedia Comput. Commun. Appl., vol. 14, no. 2, p. 48, 2018.

[24]
S. Chen and Q. Zhao, Boosted attention: Leveraging human attention for image captioning, in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 72–88.
DOI
[25]
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6077–6086.
DOI
[26]

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.

[27]
D. Liu, Z. J. Zha, H. Zhang, Y. Zhang, and F. Wu, Context-aware visual policy network for sequence-level image captioning, in Proc. 26th ACM Int. Conf. Multimedia, Seoul, Republic of Korea, 2018, pp. 1416–1424.
DOI
[28]
L. Ke, W. Pei, R. Li, X. Shen, and Y. W. Tai, Reflective decoding network for image captioning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 8887–8896.
DOI
[29]
Y. Qin, J. Du, Y. Zhang, and H. Lu, Look back and predict forward in image captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 8359–8367.
DOI
[30]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.
[31]
G. Li, L. Zhu, P. Liu, and Y. Yang, Entangled transformer for image captioning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 8927–8936.
DOI
[32]
S. Herdade, A. Kappeler, K. Boakye, and J. Soares, Image captioning: Transforming objects into words, in Proc. 33rd Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 11137–11147.
[33]
L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, Normalized and geometry-aware self-attention network for image captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10324–10333.
DOI
[34]
L. Huang, W. Wang, J. Chen, and X. Y. Wei, Attention on attention for image captioning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4633–4642.
DOI
[35]
M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, Meshed-memory transformer for image captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10575–10584.
DOI
[36]
Y. Pan, T. Yao, Y. Li, and T. Mei, X-linear attention networks for image captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10968–10977.
DOI
[37]
H. Jiang, I. Misra, M. Rohrbach, E. Learned-Miller, and X. Chen, In defense of grid features for visual question answering, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10264–10273.
DOI
[38]
X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, and R. Ji, RSTNet: Captioning with adaptive attention on visual and non-visual words, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 15460–15469.
DOI
[39]
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2019.
[40]

Y. Luo, J. Ji, X. Sun, L. Cao, Y. Wu, F. Huang, C. W. Lin, and R. Ji, Dual-level collaborative transformer for image captioning, Proc. AAAI Conf. Artif. Intell., vol. 35, no. 3, pp. 2286–2293, 2021.

[41]
T. Yao, Y. Pan, Y. Li, and T. Mei, Exploring visual relationship for image captioning, in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 711–727.
DOI
[42]
T. Yao, Y. Pan, Y. Li, and T. Mei, Hierarchy parsing for image captioning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 2621–2629.
DOI
[43]
K. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask R-CNN, in Proc. 2017 Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2980–2988.
DOI
[44]
C. Lu, R. Krishna, M. Bernstein, and F. F. Li, Visual relationship detection with language priors, in Proc. 14th European Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 852–869.
DOI
[45]
D. Xu, Y. Zhu, C. B. Choy, and F. F. Li, Scene graph generation by iterative message passing, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3097–3106.
DOI
[46]
X. Yang, K. Tang, H. Zhang, and J. Cai, Auto-encoding scene graphs for image captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 10677–10686.
DOI
[47]
K. Nguyen, S. Tripathi, B. Du, T. Guha, and T. Q. Nguyen, In defense of scene graphs for image captioning, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 1387–1396.
DOI
[48]
B. Dai, S. Fidler, R. Urtasun, and D. Lin, Towards diverse and natural image descriptions via a conditional GAN, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2989–2998.
DOI
[49]
A. Deshpande, J. Aneja, L. Wang, A. G. Schwing, and D. Forsyth, Fast, diverse and accurate image captioning guided by part-of-speech, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 10687–10696.
DOI
[50]
P. Dognin, I. Melnyk, Y. Mroueh, J. Ross, and T. Sercu, Adversarial semantic alignment for improved image captions, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 10455–10463.
DOI
[51]

J. Zhang, K. Mei, Y. Zheng, and J. Fan, Integrating part of speech guidance for image captioning, IEEE Trans. Multimedia, vol. 23, pp. 92–104, 2021.

[52]
Y. Zhou, M. Wang, D. Liu, Z. Hu, and H. Zhang, More grounded image captioning by distilling image-text matching model, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 4776–4785.
DOI
[53]

C. Wang, H. Yang, and C. Meinel, Image captioning with deep bidirectional LSTMs and multi-task learning, ACM Trans. Multimedia Comput. Commun. Appl., vol. 14, no. 2s, p. 40, 2018.

[54]
W. Zhao, B. Wang, J. Ye, M. Yang, Z. Zhao, R. Luo, and Y. Qiao, A multi-task learning approach for image captioning, in Proc. 27th Int. Joint Conf. Artificial Intelligence, Stockholm, Sweden, 2018, pp. 1205–1211.
DOI
[55]
C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, Collecting image annotations using Amazon’S Mechanical Turk, in Proc. NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, Los Angeles, CA, USA, 2010, pp. 139–147.
[56]

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Trans. Assoc. Comput. Linguist., vol. 2, pp. 67–78, 2014.

[57]
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2641–2649.
DOI
[58]
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.
DOI
[59]
X. Chen, H. Fang, T. Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, Microsoft COCO captions: Data collection and evaluation server, arXiv preprint arXiv: 1504.00325, 2015.
[60]
P. Sharma, N. Ding, S. Goodman, and R. Soricut, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning, in Proc. 56th Annu. Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 2018, pp. 2556–2565.
DOI
[61]
S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 3557–3567.
DOI
[62]
D. Gurari, Y. Zhao, M. Zhang, and N. Bhattacharya, Captioning images taken by people who are blind, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 417–434.
DOI
[63]
O. Sidorov, R. Hu, M. Rohrbach, and A. Singh, TextCaps: A dataset for image captioning with reading comprehension, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 742–758.
DOI
[64]
J. Pont-Tuset, J. Uijlings, S. Changpinyo, R. Soricut, and V. Ferrari, Connecting vision and language with localized narratives, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 647–664.
DOI
[65]
K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, Bleu: A method for automatic evaluation of machine translation, in Proc. 40th Annu. Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 2002, pp. 311–318.
DOI
[66]
S. Banerjee and A. Lavie, METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, in Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA 2005, pp. 65–72.
[67]
C. Y. Lin, ROUGE: A package for automatic evaluation of summaries, in Proc. Text Summarization Branches Out, Barcelona, Spain, 2004, pp. 74–81.
[68]
P. Anderson, B. Fernando, M. Johnson, and S. Gould, SPICE: Semantic propositional image caption evaluation, in Proc. 14th European Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 382–398.
DOI
[69]
R. Vedantam, C. L. Zitnick, and D. Parikh, CIDEr: Consensus-based image description evaluation, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 4566–4575.
DOI
[70]
S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, Self-critical sequence training for image captioning, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1179–1195.
DOI
[71]
A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, et al. , Video in sentences out, in Proc. 28th Conf. Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, 2012, pp. 102–112.
[72]
P. Das, C. Xu, R. F. Doell, and J. J. Corso, A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching, in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 2634–2641.
DOI
[73]
S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko, YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition, in Proc. 2013 IEEE Int. Conf. Computer Vision, Sydney, Australia, 2013, pp. 2712–2719.
DOI
[74]
M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele, Translating video content to natural language descriptions, in Proc. 2013 IEEE Int. Conf. Computer Vision, Sydney, Australia, 2013, pp. 433–440.
DOI
[75]
A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele, Coherent multi-sentence video description with variable level of detail, in Proc. 36th German Conf. Pattern Recognition, Münster, Germany, 2014, pp. 184–195.
DOI
[76]
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, Translating videos to natural language using deep recurrent neural networks, in Proc. 2015 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 2015, pp. 1494–1504.
DOI
[77]
S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, Sequence to sequence-video to text, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 4534–4542.
DOI
[78]
J. Xu, T. Yao, Y. Zhang, and T. Mei, Learning multimodal attention LSTM networks for video captioning, in Proc. 25th ACM Int. Conf. Multimedia, Mountain View, CA, USA, 2017, pp. 537–545.
DOI
[79]

L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, Video captioning with attention-based LSTM and semantic consistency, IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017.

[80]
X. Li, B. Zhao, and X. Lu, MAM-RNN: Multi-level attention model based RNN for video captioning, in Proc. 26th Int. Joint Conf. Artificial Intelligence, Melbourne, Australia, 2017, pp. 2208–2214.
DOI
[81]

C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai, STAT: Spatial-temporal attention mechanism for video captioning, IEEE Trans. Multimedia, vol. 22, no. 1, pp. 229–241, 2020.

[82]

B. Zhao, X. Li, and X. Lu, Cam-RNN: Co-attention model based RNN for video captioning, IEEE Trans. Image Process., vol. 28, no. 11, pp. 5552–5565, 2019.

[83]

S. Chen and Y. G. Jiang, Motion guided spatial attention for video captioning, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 8191–8198, 2019.

[84]

L. Gao, X. Wang, J. Song, and Y. Liu, Fused GRU with semantic-temporal attention for video captioning, Neurocomputing, vol. 395, pp. 222–228, 2020.

[85]
B. Shi, L. Ji, Z. Niu, N. Duan, M, Zhou, and X. Chen, Learning semantic concepts and temporal alignment for narrated video procedural captioning, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 4355–4363.
DOI
[86]
L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, Describing videos by exploiting temporal structure, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 4507–4515.
DOI
[87]

L. Wang, H. Li, H. Qiu, Q. Wu, F. Meng, and K. N. Ngan, POS-trends dynamic-aware model for video caption, IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4751–4764, 2022.

[88]
W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y. W. Tai, Memory-attended recurrent network for video captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 8339–8348.
DOI
[89]

J. Deng, L. Li, B. Zhang, S. Wang, Z. Zha, and Q. Huang, Syntax-guided hierarchical attention network for video captioning, IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 2, pp. 880–892, 2022.

[90]

H. Ryu, S. Kang, H. Kang, and C. D. Yoo, Semantic grouping network for video captioning, Proc. AAAI Conf. Artif. Intell., vol. 35, no. 3, pp. 2514–2522, 2021.

[91]
J. Zhang and Y. Peng, Object-aware aggregation with bidirectional temporal graph for video captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 8319–8328.
DOI
[92]
B. Pan, H. Cai, D. A. Huang, K. H. Lee, A. Gaidon, E. Adeli, and J. C. Niebles, Spatio-temporal graph for video captioning with knowledge distillation, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10867–10876.
DOI
[93]
Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. J. Zha, Object relational graph with teacher-recommended learning for video captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 13275–13285.
DOI
[94]
Y. Bai, J. Wang, Y. Long, B. Hu, Y. Song, M. Pagnucco, and Y. Guan, Discriminative latent semantic graph for video captioning, in Proc. 29th ACM Int. Conf. Multimedia, China, 2021, pp. 3556–3564.
DOI
[95]

X. Hua, X. Wang, T. Rui, F. Shao, and D. Wang, Adversarial reinforcement learning with object-scene relational graph for video captioning, IEEE Trans. Image Process., vol. 31, pp. 2004–2016, 2022.

[96]
B. Wang, L. Ma, W. Zhang, W. Jiang, J. Wang, and W. Liu, Controllable video captioning with POS sequence guidance based on gated fusion network, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 2641–2650.
DOI
[97]
J. Hou, X. Wu, W. Zhao, J. Luo, and Y. Jia, Joint syntax representation learning and visual cue translation for video captioning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 8917–8926.
DOI
[98]
Q. Zheng, C. Wang, and D. Tao, Syntax-aware action targeting for video captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 13093–13102.
DOI
[99]
G. Tan, D. Liu, M. Wang, and Z. J. Zha, Learning to discretely compose reasoning module networks for video captioning, in Proc. 29th Int. Joint Conf. Artificial Intelligence, Yokohama, Japan, 2021, pp. 745–752.
DOI
[100]
X. Wang, W. Chen, J. Wu, Y. F. Wang, and W. Y. Wang, Video captioning via hierarchical reinforcement learning, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4213–4222.
DOI
[101]
L. Li and B. Gong, End-to-end video captioning with multitask reinforcement learning, in Proc. 2019 IEEE Winter Conf. Applications of Computer Vision, Waikoloa, HI, USA, 2019, pp. 339–348.
DOI
[102]
K. Lin, L. Li, C. C. Lin, F. Ahmed, Z. Gan, Z. Liu, Y. Lu, and L. Wang, SwinBERT: End-to-end transformers with sparse attention for video captioning, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 17928–17937.
DOI
[103]
Y. Bin, X. Shang, B. Peng, Y. Ding, and T. S. Chua, Multi-perspective video captioning, in Proc. 29th ACM Int. Conf. Multimedia, China, 2021, pp. 5110–5118.
DOI
[104]
A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, A dataset for movie description, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3202–3212.
DOI
[105]
A. Torabi, C. Pal, H. Larochelle, and A. Courville, Using descriptive video services to create a large data source for video annotation research, arXiv preprint arXiv: 1503.01070, 2015.
[106]
J. Xu, T. Mei, T. Yao, and Y. Rui, MSR-VTT: A large video description dataset for bridging video and language, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 5288–5296.
DOI
[107]
B. Ghanem, J. C. Niebles, C. Snoek, F. C. Heilbron, H. Alwassel, R. Khrisna, V. Escorcia, K. Hata, and S. Buch, ActivityNet challenge 2017 summary, arXiv preprint arXiv: 1710.08011, 2017.
[108]
X. Wang, J. Wu, J. Chen, L. Li, Y. F. Wang, and W. Y. Wang, VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4580–4590.
DOI
[109]
J. Johnson, A. Karpathy, and F. F. Li, DenseCap: Fully convolutional localization networks for dense captioning, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4565–4574.
DOI
[110]
G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, and J. Shao, Context and attribute grounded dense captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6234–6243.
DOI
[111]
D. J. Kim, J. Choi, T. H. Oh, and I. S. Kweon, Dense relational captioning: Triple-stream networks for relationship-based captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6264–6273.
DOI
[112]
Z. Chen, A. Gholami, M. Nießner, and A. X. Chang, Scan2Cap: Context-aware dense captioning in RGB-D scans, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 3192–3202.
DOI
[113]
Z. Yuan, X. Yan, Y. Liao, Y. Guo, G. Li, S. Cui, and Z. Li, X-Trans2Cap: Cross-modal knowledge transfer using transformer for 3D dense captioning, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 8553–8563.
DOI
[114]
J. Krause, J. Johnson, R. Krishna, and F. F. Li, A hierarchical approach for generating descriptive image paragraphs, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3337–3345.
DOI
[115]
Z. Wang, Y. Luo, Y. Li, Z. Huang, and H. Yin, Look deeper see richer: Depth-aware image paragraph captioning, in Proc. 26th ACM Int. Conf. Multimedia, Seoul, Republic of Korea, 2018, pp. 672–680.
DOI
[116]
J. Wang, Y. Pan, T. Yao, J. Tang, and T. Mei, Convolutional auto-encoding of sentence topics for image paragraph generation, in Proc. 28th Int. Joint Conf. Artificial Intelligence, Macao, China, 2019, pp. 940–946.
DOI
[117]
Y. Liu, Y. Shi, F. Feng, R. Li, Z. Ma, and X. Wang, Improving image paragraph captioning with dual relations, in Proc. 2022 IEEE Int. Conf. Multimedia and Expo, Taipei, China, 2022, pp. 1–6.
DOI
[118]
L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach, Grounded video description, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6571–6580.
DOI
[119]
C. Y. Ma, Y. Kalantidis, G. AlRegib, P. Vajda, M. Rohrbach, and Z. Kira, Learning to generate grounded visual captions without localization supervision, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 353–370.
DOI
[120]
N. Chen, X. Pan, R. Chen, L. Yang, Z. Lin, Y. Ren, H. Yuan, X. Guo, F. Huang, and W. Wang, Distributed attention for grounded image captioning, in Proc. 29th ACM Int. Conf. Multimedia, China, 2021, pp. 1966–1975.
DOI
[121]
M. Cornia, L. Baraldi, and R. Cucchiara, Show, control and tell: A framework for generating controllable and grounded captions, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 8299–8308.
DOI
[122]
C. Deng, N. Ding, M. Tan, and Q. Wu, Length-controllable image captioning, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 712–729.
DOI
[123]
S. Chen, Q. Jin, P. Wang, and Q. Wu, Say as you wish: Fine-grained control of image caption generation with abstract scene graphs, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 9959–9968.
DOI
[124]
L. Chen, Z. Jiang, J. Xiao, and W. Liu, Human-like controllable image captioning with verb-specific semantic roles, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 16841–16851.
DOI
[125]
L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell, Deep compositional captioning: Describing novel object categories without paired training data, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 1–10.
DOI
[126]
Y. Wu, L. Zhu, L. Jiang, and Y. Yang, Decoupled novel object captioner, in Proc. 26th ACM Int. Conf. Multimedia, Seoul, Republic of Korea, 2018, pp. 1029–1037.
DOI
[127]
H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson, nocaps: Novel object captioning at scale, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 8947–8956.
DOI
[128]

X. Hu, X. Yin, K. Lin, L. Zhang, J. Gao, L. Wang, and Z. Liu, VIVO: Visual vocabulary pre-training for novel object captioning, Proc. AAAI Conf. Artificial Intelligence, vol. 35, no. 2, pp. 1575–1583, 2021.

[129]
D. M. Vo, H. Chen, A. Sugimoto, and H. Nakayama, NOC-REK: Novel object captioning with retrieved vocabulary from external knowledge, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 17979–17987.
[130]

S. Frolov, T. Hinz, F. Raue, J. Hees, and A. Dengel, Adversarial text-to-image synthesis: A review, Neural Netw., vol. 144, pp. 187–209, 2021.

[131]
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, Generative adversarial text to image synthesis, in Proc. 33rd Int. Conf. Machine Learning, New York, NY, USA, 2016, pp. 1060–1069.
[132]
S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, Vector quantized diffusion model for text-to-image synthesis, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 10686–10696.
DOI
[133]
F. A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, Diffusion models in vision: A survey, arXiv preprint arXiv: 2209.04747, 2022.
[134]
J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy, Generation and comprehension of unambiguous object descriptions, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 11–20.
DOI
[135]
L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, Modeling context in referring expressions, in Proc. 14th European Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 69–85.
DOI
[136]
J. Liu, L. Wang, and M. H. Yang, Referring expression generation and comprehension via attributes, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 4866–4874.
DOI
[137]
R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell, Natural language object retrieval, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4555–4564.
DOI
[138]
V. K. Nagaraja, V. I. Morariu, and L. S. Davis, Modeling context between objects for referring expression understanding, in Proc. 14th Eur. Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 792–807.
DOI
[139]
H. Zhang, Y. Niu, and S. Fu. Chang, Grounding referring expressions in images by variational context, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4158–4166.
DOI
[140]
B. Zhuang, Q. Wu, C. Shen, I. Reid, and A. Van Den Hengel, Parallel attention: A unified framework for visual object discovery through dialogs and queries, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4252–4261.
DOI
[141]
C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, and M. Tan, Visual grounding via accumulated attention, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7746–7755.
DOI
[142]
R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko, Modeling relationships in referential expressions with compositional modular networks, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 4418–4427.
DOI
[143]
L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, MAttNet: Modular attention network for referring expression comprehension, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1307–1315.
DOI
[144]
X. Liu, Z. Wang, J. Shao, X. Wang, and H. Li, Improving referring expression grounding with cross-modal attention-guided erasing, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1950–1959.
DOI
[145]
V. Cirik, T. Berg-Kirkpatrick, and L. P. Morency, Using syntax to ground referring expressions in natural images, in Proc. 32nd AAAI Conf. Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conf. and 8th AAAI Symp. Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2018, pp. 6756–6764.
DOI
[146]
D. Liu, H. Zhang, Z. J. Zha, and F. Wu, Learning to assemble neural module tree networks for visual grounding, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4672–4681.
DOI
[147]
D. Chen and C. Manning, A fast and accurate dependency parser using neural networks, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing, Doha, Qatar, 2014, pp. 740–750.
DOI
[148]
P. Wang, Q. Wu, J. Cao, C. Shen, L. Gao, and A. van den Hengel, Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1960–1968.
DOI
[149]
S. Yang, G. Li, and Y. Yu, Dynamic graph attention for referring expression comprehension, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4643–4652.
DOI
[150]
S. Yang, G. Li, and Y. Yu, Graph-structured referring expression reasoning in the wild, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 9949–9958.
DOI
[151]
S. Schuster, R. Krishna, A. Chang, F. F. Li, and C. D. Manning, Generating semantically precise scene graphs from textual descriptions for improved image retrieval, in Proc. 4th Workshop on Vision and Language, Lisbon, Portugal, 2015, pp. 70–80.
DOI
[152]
C. Jing, Y. Wu, M. Pei, Y. Hu, Y. Jia, and Q. Wu, Visual-semantic graph matching for visual grounding, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 4041–4050.
DOI
[153]
T. Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, Feature pyramid networks for object detection, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 936–944.
DOI
[154]
Z. Yang, B. Gong, L. Wang, W. Huang, D. Yu, and J. Luo, A fast and accurate one-stage approach to visual grounding, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4682–4692.
DOI
[155]
J. Redmon and A. Farhadi, YOLOv3: An incremental improvement, arXiv preprint arXiv: 1804.02767, 2018.
[156]
Z. Yang, T. Chen, L. Wang, and J. Luo, Improving one-stage visual grounding by recursive sub-query construction, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 387–404.
DOI
[157]
B. Huang, D. Lian, W. Luo, and S. Gao, Look before you leap: Learning landmark features for one-stage visual grounding, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 16883–16892.
DOI
[158]
Y. Liao, S. Liu, G. Li, F. Wang, Y. Chen, C. Qian, and B. Li, A real-time cross-modality correlation filtering method for referring expression comprehension, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10877–10886.
DOI
[159]
X. Zhou, D. Wang, and P. Krähenbühl, Objects as points, arXiv preprint arXiv: 1904.07850, 2019.
[160]
H. Qiu, H. Li, Q. Wu, F. Meng, H. Shi, T. Zhao, and K. N. Ngan, Language-aware fine-grained object representation for referring expression comprehension, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 4171–4180.
DOI
[161]
Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, RepPoints: Point set representation for object detection, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 9656–9665.
DOI
[162]
J. Deng, Z. Yang, T. Chen, W. Zhou, and H. Li, TransVG: End-to-end visual grounding with transformers, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 1749–1759.
DOI
[163]
G. Luo, Y. Zhou, X. Sun, L. Cao, C. Wu, C. Deng, and R. Ji, Multi-task collaborative network for joint referring expression comprehension and segmentation, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10031–10040.
DOI
[164]
M. Li and L. Sigal, Referring transformer: A one-step approach to multi-task visual grounding, in Proc. 35th Conf. Neural Information Processing Systems, Vancouver, Canada, 2021, pp. 19652–19664.
[165]
M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng, Structured matching for phrase localization, in Proc. 14th European Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 696–711.
DOI
[166]
K. Chen, R. Kovvuri, and R. Nevatia, Query-guided regression network with context policy for phrase grounding, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 824–832.
DOI
[167]
P. Dogan, L. Sigal, and M. Gross, Neural sequential phrase grounding (SeqGROUND), in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 4170–4179.
DOI
[168]

Y. Liu, B. Wan, X. Zhu, and X. He, Learning cross-modal context graph for visual grounding, Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7, pp. 11645–11652, 2020.

[169]
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, ReferItGame: Referring to objects in photographs of natural scenes, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing, Doha, Qatar, 2014, pp. 787–798.
DOI
[170]
M. Grubinger, P. Clough, H. Müller, and T. Deselaers, The IAPR TC-12 benchmark: A new evaluation resource for visual information systems, In International workshop ontoImage,http://thomas.deselaers.de/publications/papers/grubinger_lrec06.pdf.
[171]
H. De Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. Courville, GuessWhat?! Visual object discovery through multi-modal dialogue, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 4466–4475.
DOI
[172]
R. Liu, C. Liu, Y. Bai, and A. L. Yuille, CLEVR-Ref+: Diagnosing visual reasoning with referring expressions, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 4180–4189.
DOI
[173]
J. Johnson, B. Hariharan, L. Van Der Maaten, F. F. Li, C. L. Zitnick, and R. Girshick, CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1988–1997.
DOI
[174]
Z. Chen, P. Wang, L. Ma, K. Y. K. Wong, and Q. Wu, Cops-ref: A new dataset and task on compositional referring expression comprehension, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10083–10092.
DOI
[175]
R. Hu, M. Rohrbach, and T. Darrell, Segmentation from natural language expressions, in Proc. 14th European Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 108–124.
DOI
[176]
C. Liu, Z. Lin, X. Shen, J. Yang, X. Lu, and A. Yuille, Recurrent multimodal interaction for referring image segmentation, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1280–1289.
DOI
[177]
R. Li, K. Li, Y. C. Kuo, M. Shu, X. Qi, X. Shen, and J. Jia, Referring image segmentation via recurrent refinement networks, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 5745–5753.
DOI
[178]
E. Margffoy-Tuay, J. C. Pérez, E. Botero, and P. Arbeláez, Dynamic multimodal instance segmentation guided by natural language queries, in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 656–672.
DOI
[179]
D. J. Chen, S. Jia, Y. C. Lo, H. T. Chen, and T. L. Liu, See-through-text grouping for referring image segmentation, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 7453–7462.
DOI
[180]
H. Shi, H. Li, F. Meng, and Q. Wu, Key-word-aware network for referring expression image segmentation, in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 38–54.
DOI
[181]
L. Ye, M. Rochan, Z. Liu, and Y. Wang, Cross-modal self-attention network for referring image segmentation, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 10494–10503.
DOI
[182]
Z. Hu, G. Feng, J. Sun, L. Zhang, and H. Lu, Bi-directional relationship inferring network for referring image segmentation, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 4423–4432.
DOI
[183]
G. Feng, Z. Hu, L. Zhang, and H. Lu, Encoder fusion network with co-attention embedding for referring image segmentation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 15501–15510.
DOI
[184]
G. Luo, Y. Zhou, R. Ji, X. Sun, J. Su, C. W. Lin, and Q. Tian, Cascade grouped attention network for referring expression segmentation, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 1274–1282.
DOI
[185]
S. Huang, T. Hui, S. Liu, G. Li, Y. Wei, J. Han, L. Liu, and B. Li, Referring image segmentation via cross-modal progressive comprehension, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10485–10494.
DOI
[186]
T. Hui, S. Liu, S. Huang, G. Li, S. Yu, F. Zhang, and J. Han, Linguistic structure guided context modeling for referring image segmentation, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 59–75.
DOI
[187]
S. Yang, M. Xia, G. Li, H. Y. Zhou, and Y. Yu, Bottom-up shift and reasoning for referring image segmentation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 11261–11270.
DOI
[188]
Y. Jing, T. Kong, W. Wang, L. Wang, L. Li, and T. Tan, Locate then segment: A strong pipeline for referring image segmentation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 9853–9862.
DOI
[189]
H. Ding, C. Liu, S. Wang, and X. Jiang, Vision-language transformer and query generation for referring segmentation, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 16301–16310.
DOI
[190]

H. J. Escalante, C. A. Hernández, J. A. Gonzalez, A. López-López, M. Montes, E. F. Morales, L. E. Sucar, L. Villaseñor, and M. Grubinger, The segmented and annotated IAPR TC-12 benchmark, Comput. Vis. Image Underst., vol. 114, no. 4, pp. 419–428, 2010.

[191]

L. Peng, Y. Yang, Z. Wang, Z. Huang, and H. T. Shen, MRA-net: Improving VQA via multi-modal relation attention network, IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 1, pp. 318–329, 2022.

[192]
R. Hu, A. Rohrbach, T. Darrell, and K. Saenko, Language-conditioned graph networks for relational reasoning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 10293–10302.
DOI
[193]

Q. Cao, X. Liang, B. Li, and L. Lin, Interpretable visual question answering by reasoning on dependency trees, IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 3, pp. 887–901, 2021.

[194]
K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. B. Tenenbaum, Neural-symbolic VQA: Disentangling reasoning from vision and language understanding, in Proc. 32nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 1039–1050.
[195]
S. Amizadeh, H. Palangi, A. Polozov, Y. Huang, and K. Koishida, Neuro-symbolic visual reasoning: Disentangling “ visual ” from “ reasoning ”, in Proc. 37th Int. Conf. Machine Learning, 2020, pp. 279–290.
[196]
Y. Niu, K. Tang, H. Zhang, Z. Lu, X. S. Hua, and J. R. Wen, Counterfactual VQA: A cause-effect look at language bias, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 12695–12705.
DOI
[197]
L. Chen, X. Yan, J. Xiao, H. Zhang, S. Pu, and Y. Zhuang, Counterfactual samples synthesizing for robust visual question answering, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10797–10806.
DOI
[198]
J. Lu, D. Batra, D. Parikh, and S. Lee, ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, in Proc. 33rd Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 13–23.
[199]
Z. Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, P. Zhang, L. Yuan, N. Peng, et al. , An empirical study of training end-to-end vision-and-language transformers, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 18145–18155.
DOI
[200]
P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, VinVL: Revisiting visual representations in vision-language models, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 5575–5584.
DOI
[201]
J. Li, R. R. Selvaraju, A. D. Gotmare, S. Joty, C. Xiong, and S. Hoi, Align before fuse: Vision and language representation learning with momentum distillation, arXiv preprint arXiv: 2107.07651, 2021.
[202]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, VQA: Visual question answering, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2425–2433.
DOI
[203]
C. L. Zitnick and D. Parikh, Bringing semantics into focus using visual abstraction, in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 3009–3016.
DOI
[204]

Y. Goyal, T. Khot, A. Agrawal, D. Summers-Stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in visual question answering, Int. J. Comput. Vis., vol. 127, no. 4, pp. 398–414, 2019.

[205]
A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, Don’t just assume; look and answer: Overcoming priors for visual question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4971–4980.
DOI
[206]
D. A. Hudson and C. D. Manning, GQA: A new dataset for real-world visual reasoning and compositional question answering, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6693–6702.
DOI
[207]

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, Int. J. Comput. Vis., vol. 123, no. 1, pp. 32–73, 2017.

[208]
Y. Hong, L. Yi, J. B. Tenenbaum, A. Torralba, and C. Gan, PTR: A benchmark for part-based conceptual, relational, and physical reasoning, in Proc. 35th Neural Information Processing Systems, 2021, pp. 17427–17440.
[209]
P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel, Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 3674–3683.
DOI
[210]
A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra, Improving vision-and-language navigation with image-text pairs from the web, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 259–274.
DOI
[211]
W. Hao, C. Li, X. Li, L. Carin, and J. Gao, Towards learning a generic agent for vision-and-language navigation via pre-training, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2020, pp. 13134–13143.
DOI
[212]
Y. Qi, Z. Pan, S. Zhang, A. van den Hengel, and Q. Wu, Object-and-action aware model for visual language navigation, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 303–317.
DOI
[213]
K. Chen, J. K. Chen, J. Chuang, M. Vazquez, and S. Savarese, Topological planning with transformers for vision-and-language navigation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 11271–11281.
DOI
[214]
Y. Zhu, F. Zhu, Z. Zhan, B. Lin, J. Jiao, X. Chang, and X. Liang, Vision-dialog navigation by exploring cross-modal memory, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10727–10736.
DOI
[215]
F. Zhu, Y. Zhu, X. Chang, and X. Liang, Vision-language navigation with self-supervised auxiliary reasoning tasks, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10009–10019.
DOI
[216]
A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, Embodied question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1–10.
DOI
[217]
A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, Neural modular control for embodied question answering, in Proc. 2nd Conf. Robot Learning, Zürich, Switzerland, 2018, pp. 53–62.
DOI
[218]
N. Ilinykh, Y. Emampoor, and S. Dobnik, Look and answer the question: On the role of vision in embodied question answering, in Proc. 15th Int. Conf. Natural Language Generation, Waterville, ME, USA, 2022, pp. 236–245.
[219]
A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, and D. Batra, Visual dialog, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1080–1089.
DOI
[220]
J. Qi, Y. Niu, J. Huang, and H. Zhang, Two causal principles for improving visual dialog, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10857–10866.
DOI
[221]
Y. Niu, H. Zhang, M. Zhang, J. Zhang, Z. Lu, and J. R. Wen, Recursive visual attention in visual dialog, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6672–6681.
DOI
[222]
S. Zhang, X. Jiang, Z. Yang, T. Wan, and Z. Qin, Reasoning with multi-structure commonsense knowledge in visual dialog, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, New Orleans, LA, USA, 2022, pp. 4599–4608.
DOI
[223]
H. Tan and M. Bansal, LXMERT: Learning cross-modality encoder representations from transformers, in Proc. 2019 Conf. Empirical Methods in Natural Language Processing and the 9th Int. Joint Conf. Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 5100–5111.
DOI
[224]
G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, and M. Zhou, Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training, arXiv preprint arXiv: 1908.06066, 2019.
[225]
W. Kim, B. Son, and I. Kim, ViLT: Vision-and-language transformer without convolution or region supervision, in Proc. 38th Int. Conf. Machine Learning, 2021, pp. 5583–5594.
[226]
Y. Li, H. Fan, R. Hu, C. Feichtenhofer, and K. He, Scaling language-image pre-training via masking, arXiv preprint arXiv: 2212.00794, 2022.
[227]
S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg, et al. , A generalist agent, arXiv preprint arXiv: 2205.06175, 2022.
[228]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. , Language models are few-shot learners, in Proc. 34th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 1877–1901.
[229]
T. Schick and H. Schütze, Exploiting cloze-questions for few-shot text classification and natural language inference, in Proc. 16th Conf. European Chapter of the Association for Computational Linguistics, 2021, pp. 255–269.
DOI
[230]
Y. Yao, A. Zhang, Z. Zhang, Z. Liu, T. S. Chua, and M. Sun, CPT: Colorful prompt tuning for pre-trained vision-language models, arXiv preprint arXiv: 2109.11797, 2022.
DOI
[231]
J. B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al., Flamingo: A visual language model for few-shot learning, arXiv preprint arXiv: 2204.14198, 2022.
[232]
Y. Liu, W. Wei, D. Peng, and F. Zhu, Declaration-based prompt tuning for visual question answering, in Proc. 31st Int. Joint Conf. Artificial Intelligence, Vienna, Austria, 2022, pp. 3264–3270.
DOI
[233]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. , Learning transferable visual models from natural language supervision, in Proc. 38th Int. Conf. Machine Learning, 2021, pp. 8748–8763.
[234]
J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan, Hierarchical memory modelling for video captioning, in Proc. 26th ACM Int. Conf. Multimedia, Seoul, Republic of Korea, 2018, pp. 63–71.
DOI
[235]
Y. Chen, S. Wang, W. Zhang, and Q. Huang, Less is more: Picking informative frames for video captioning, in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 367–384.
DOI
[236]
B. Wang, L. Ma, W. Zhang, and W. Liu, Reconstruction network for video captioning, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7622–7631.
DOI
[237]
Y. Hu, Z. Chen, Z. J. Zha, and F. Wu, Hierarchical global-local temporal modeling for video captioning, in Proc. 27th ACM Int. Conf. Multimedia, Nice, France, 2019, pp. 774–783.
DOI
[238]
Y. Zhu and S. Jiang, Attention-based densely connected LSTM for video captioning, In Proc. 27th ACM Int. Conf. Multimedia, Nice, France, 2019, pp. 802–810.
DOI
[239]
S. Liu, Z. Ren, and J. Yuan, SibNet: Sibling convolutional encoder for video captioning, in Proc. 26th ACM Int. Conf. Multimedia, Seoul, Republic of Korea, 2018, pp. 1425–1434.
DOI
[240]
Y. Lei, Z. He, P. Zeng, J. Song, and L. Gao, Hierarchical representation network with auxiliary tasks for video captioning, in Proc. 2021 IEEE Int. Conf. Multimedia and Expo, Shenzhen, China, 2021, pp. 1–6.
DOI
[241]
F. Liu, X. Ren, X. Wu, S. Ge, W. Fan, Y. Zou, and X. Sun, Prophet attention: Predicting attention with future attention, in Proc. 34th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 1865–1876.
[242]
W. Zhang, H. Shi, S. Tang, J. Xiao, Q. Yu, and Y. Zhuang, Consensus graph representation learning for better grounded image captioning, in Proc. AAAI Conf. Artif. Intell., vol. 35, no. 4, pp. 3394–3402, 2021.
DOI
[243]
V. Cirik, L. P. Morency, and T. Berg-Kirkpatrick, Visual referring expression recognition: What do systems actually learn? arXiv preprint arXiv: 1805.11818, 2018.
DOI
[244]

R. M. French, Catastrophic forgetting in connectionist networks, Trends Cogn. Sci., vol. 3, no. 4, pp. 128–135, 1999.

Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 04 July 2022
Revised: 15 December 2022
Accepted: 27 December 2022
Published: 10 March 2023
Issue date: December 2022

Copyright

© The author(s) 2022

Acknowledgements

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (No. 61831005).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return