| Sign up

PDF (2.2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Review | Open Access

A Survey of Vision and Language Related Multi-Modal Task

Lanxiao Wang^¹, Wenzhe Hu^¹, Heqian Qiu^¹, Chao Shang^¹, Taijin Zhao^¹, Benliu Qiu^¹, King Ngi Ngan^², Hongliang Li^¹()

1Department of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

2The Chinese University of Hong Kong, Hong Kong 999077, China

Show Author Information

Abstract

With the significant breakthrough in the research of single-modal related deep learning tasks, more and more works begin to focus on multi-modal tasks. Multi-modal tasks usually involve more than one different modalities, and a modality represents a type of behavior or state. Common multi-modal information includes vision, hearing, language, touch, and smell. Vision and language are two of the most common modalities in human daily life, and many typical multi-modal tasks focus on these two modalities, such as visual captioning and visual grounding. In this paper, we conduct in-depth research on typical tasks of vision and language from the perspectives of generation, analysis, and reasoning. First, the analysis and summary with the typical tasks and some pretty classical methods are introduced, which will be generalized from the aspects of different algorithmic concerns, and be further discussed frequently used datasets and metrics. Then, some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework. Finally, we further discuss the development of pre-training related research and make an outlook for future research. We hope this survey can help relevant researchers to understand the latest progress, existing problems, and exploration directions of vision and language multi-modal related tasks, and provide guidance for future research.

Keywords

deep learning vision and language multi-modal generation multi-modal analysis multi-modal reasoning pre-training

References

[1]

J. Y. Pan, H. J. Yang, P. Duygulu, and C. Faloutsos, Automatic image captioning, in Proc. 2004 IEEE Int. Conf. Multimedia and Expo, Taipei, China, 2004, pp. 1987–1990.

[2]

A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth, Every picture tells a story: Generating sentences from images, in Proc. 11^th European Conf. Computer Vision, Heraklion, Greece, 2010, pp. 15–29.

[3]

V. Ordonez, G. Kulkarni, and T. L. Berg, Im2Text: Describing images using 1 million captioned photographs, in Proc. 24^th Int. Conf. Neural Information Processing Systems, Granada, Spain, 2011, pp. 1143–1151.

[4]

Y. Yang, C. Teo, H. Daumé, and Y. Aloimonos, Corpus-guided sentence generation of natural images, in Proc. Conf. Empirical Methods in Natural Language Processing, Edinburgh, UK, 2011, pp. 444–454.

[5]

G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, BabyTalk: Understanding and generating simple image descriptions, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 12, pp. 2891–2903, 2013.

Crossref Google Scholar

[6]

J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and F. F. Li, ImageNet: A large-scale hierarchical image database, in Proc. 2009 IEEE Conf. Computer Vision and Pattern Recognition, Miami, FL, USA, 2009, pp. 248–255.

[7]

A. Karpathy and F. F. Li, Deep visual-semantic alignments for generating image descriptions, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3128–3137.

[8]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, in Proc. 25^th Int. Conf. Neural Information Processing Systems, Lake Tahoe, NV, USA, 2012, pp. 1097–1105.

[9]

R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proc. 2014 IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 580–587.

[10]

O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3156–3164.

[11]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 1–9.

[12]

J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille, Deep captioning with multimodal recurrent neural networks (m-RNN), in Proc. 3^rd Int. Conf. Learning Representations, San Diego, CA, USA, 2015, pp. 1–17.

[13]

J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, T. Darrell, and K. Saenko, Long-term recurrent convolutional networks for visual recognition and description, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 2625–2634.

[14]

X. Chen and C. L. Zitnick, Mind’s eye: A recurrent visual representation for image caption generation, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 2422–2431.

[15]

X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, Guiding the long-short term memory model for image caption generation, in Proc. 2015 Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2407–2415.

[16]

Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, Image captioning with semantic attention, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4651–4659.

[17]

K. Xu, J. L. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. S. Zemel, and Y. Bengio, Show, attend and tell: Neural image caption generation with visual attention, in Proc. 32^nd Int. Conf. Machine Learning, Lille, France, 2015, pp. 2048–2057.

[18]

Y. Wang, Z. Lin, X. Shen, S. Cohen, and G. W. Cottrell, Skeleton key: Image captioning by skeleton-attribute decomposition, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 7378–7387.

[19]

W. Jiang, L. Ma, Y. G. Jiang, W. Liu, and T. Zhang, Recurrent fusion network for image captioning, in Proc. 15^th European Conf. Computer Vision, Munich, Germany, 2018, pp. 510–526.

[20]

L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. S. Chua, SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 6298–6306.

[21]

J. Lu, C. Xiong, D. Parikh, and R. Socher, Knowing when to look: Adaptive attention via a visual sentinel for image captioning, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3242–3250.

[22]

V. Ramanishka, A. Das, J. Zhang, and K. Saenko, Top-down visual saliency guided by captions, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3135–3144.

[23]

M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, Paying more attention to saliency: Image captioning with saliency and context attention, ACM Trans. Multimedia Comput. Commun. Appl., vol. 14, no. 2, p. 48, 2018.

Crossref Google Scholar

[24]

S. Chen and Q. Zhao, Boosted attention: Leveraging human attention for image captioning, in Proc. 15^th European Conf. Computer Vision, Munich, Germany, 2018, pp. 72–88.

[25]

P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6077–6086.

[26]

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.

Crossref Google Scholar

[27]

D. Liu, Z. J. Zha, H. Zhang, Y. Zhang, and F. Wu, Context-aware visual policy network for sequence-level image captioning, in Proc. 26^th ACM Int. Conf. Multimedia, Seoul, Republic of Korea, 2018, pp. 1416–1424.

[28]

L. Ke, W. Pei, R. Li, X. Shen, and Y. W. Tai, Reflective decoding network for image captioning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 8887–8896.

[29]

Y. Qin, J. Du, Y. Zhang, and H. Lu, Look back and predict forward in image captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 8359–8367.

[30]

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31^st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.

[31]

G. Li, L. Zhu, P. Liu, and Y. Yang, Entangled transformer for image captioning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 8927–8936.

[32]

S. Herdade, A. Kappeler, K. Boakye, and J. Soares, Image captioning: Transforming objects into words, in Proc. 33^rd Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 11137–11147.

[33]

L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, Normalized and geometry-aware self-attention network for image captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10324–10333.

[34]

L. Huang, W. Wang, J. Chen, and X. Y. Wei, Attention on attention for image captioning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4633–4642.

[35]

M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, Meshed-memory transformer for image captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10575–10584.

[36]

Y. Pan, T. Yao, Y. Li, and T. Mei, X-linear attention networks for image captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10968–10977.

[37]

H. Jiang, I. Misra, M. Rohrbach, E. Learned-Miller, and X. Chen, In defense of grid features for visual question answering, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10264–10273.

[38]

X. Zhang, X. Sun, Y. Luo, J. Ji, Y. Zhou, Y. Wu, F. Huang, and R. Ji, RSTNet: Captioning with adaptive attention on visual and non-visual words, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 15460–15469.

[39]

J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2019.

[40]

Y. Luo, J. Ji, X. Sun, L. Cao, Y. Wu, F. Huang, C. W. Lin, and R. Ji, Dual-level collaborative transformer for image captioning, Proc. AAAI Conf. Artif. Intell., vol. 35, no. 3, pp. 2286–2293, 2021.

Crossref Google Scholar

[41]

T. Yao, Y. Pan, Y. Li, and T. Mei, Exploring visual relationship for image captioning, in Proc. 15^th European Conf. Computer Vision, Munich, Germany, 2018, pp. 711–727.

[42]

T. Yao, Y. Pan, Y. Li, and T. Mei, Hierarchy parsing for image captioning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 2621–2629.

[43]

K. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask R-CNN, in Proc. 2017 Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2980–2988.

[44]

C. Lu, R. Krishna, M. Bernstein, and F. F. Li, Visual relationship detection with language priors, in Proc. 14^th European Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 852–869.

[45]

D. Xu, Y. Zhu, C. B. Choy, and F. F. Li, Scene graph generation by iterative message passing, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3097–3106.

[46]

X. Yang, K. Tang, H. Zhang, and J. Cai, Auto-encoding scene graphs for image captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 10677–10686.

[47]

K. Nguyen, S. Tripathi, B. Du, T. Guha, and T. Q. Nguyen, In defense of scene graphs for image captioning, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 1387–1396.

[48]

B. Dai, S. Fidler, R. Urtasun, and D. Lin, Towards diverse and natural image descriptions via a conditional GAN, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 2989–2998.

[49]

A. Deshpande, J. Aneja, L. Wang, A. G. Schwing, and D. Forsyth, Fast, diverse and accurate image captioning guided by part-of-speech, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 10687–10696.

[50]

P. Dognin, I. Melnyk, Y. Mroueh, J. Ross, and T. Sercu, Adversarial semantic alignment for improved image captions, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 10455–10463.

[51]

J. Zhang, K. Mei, Y. Zheng, and J. Fan, Integrating part of speech guidance for image captioning, IEEE Trans. Multimedia, vol. 23, pp. 92–104, 2021.

Crossref Google Scholar

[52]

Y. Zhou, M. Wang, D. Liu, Z. Hu, and H. Zhang, More grounded image captioning by distilling image-text matching model, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 4776–4785.

[53]

C. Wang, H. Yang, and C. Meinel, Image captioning with deep bidirectional LSTMs and multi-task learning, ACM Trans. Multimedia Comput. Commun. Appl., vol. 14, no. 2s, p. 40, 2018.

Crossref Google Scholar

[54]

W. Zhao, B. Wang, J. Ye, M. Yang, Z. Zhao, R. Luo, and Y. Qiao, A multi-task learning approach for image captioning, in Proc. 27^th Int. Joint Conf. Artificial Intelligence, Stockholm, Sweden, 2018, pp. 1205–1211.

[55]

C. Rashtchian, P. Young, M. Hodosh, and J. Hockenmaier, Collecting image annotations using Amazon’S Mechanical Turk, in Proc. NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk, Los Angeles, CA, USA, 2010, pp. 139–147.

[56]

P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions, Trans. Assoc. Comput. Linguist., vol. 2, pp. 67–78, 2014.

Crossref Google Scholar

[57]

B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2641–2649.

[58]

T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13^th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.

[59]

X. Chen, H. Fang, T. Y. Lin, R. Vedantam, S. Gupta, P. Dollár, and C. L. Zitnick, Microsoft COCO captions: Data collection and evaluation server, arXiv preprint arXiv: 1504.00325, 2015.

[60]

P. Sharma, N. Ding, S. Goodman, and R. Soricut, Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning, in Proc. 56^th Annu. Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, 2018, pp. 2556–2565.

[61]

S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 3557–3567.

[62]

D. Gurari, Y. Zhao, M. Zhang, and N. Bhattacharya, Captioning images taken by people who are blind, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 417–434.

[63]

O. Sidorov, R. Hu, M. Rohrbach, and A. Singh, TextCaps: A dataset for image captioning with reading comprehension, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 742–758.

[64]

J. Pont-Tuset, J. Uijlings, S. Changpinyo, R. Soricut, and V. Ferrari, Connecting vision and language with localized narratives, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 647–664.

[65]

K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, Bleu: A method for automatic evaluation of machine translation, in Proc. 40^th Annu. Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 2002, pp. 311–318.

[66]

S. Banerjee and A. Lavie, METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, in Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA 2005, pp. 65–72.

[67]

C. Y. Lin, ROUGE: A package for automatic evaluation of summaries, in Proc. Text Summarization Branches Out, Barcelona, Spain, 2004, pp. 74–81.

[68]

P. Anderson, B. Fernando, M. Johnson, and S. Gould, SPICE: Semantic propositional image caption evaluation, in Proc. 14^th European Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 382–398.

[69]

R. Vedantam, C. L. Zitnick, and D. Parikh, CIDEr: Consensus-based image description evaluation, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 4566–4575.

[70]

S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, Self-critical sequence training for image captioning, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1179–1195.

[71]

A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, et al. , Video in sentences out, in Proc. 28^th Conf. Uncertainty in Artificial Intelligence, Catalina Island, CA, USA, 2012, pp. 102–112.

[72]

P. Das, C. Xu, R. F. Doell, and J. J. Corso, A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching, in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 2634–2641.

[73]

S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko, YouTube2Text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition, in Proc. 2013 IEEE Int. Conf. Computer Vision, Sydney, Australia, 2013, pp. 2712–2719.

[74]

M. Rohrbach, W. Qiu, I. Titov, S. Thater, M. Pinkal, and B. Schiele, Translating video content to natural language descriptions, in Proc. 2013 IEEE Int. Conf. Computer Vision, Sydney, Australia, 2013, pp. 433–440.

[75]

A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele, Coherent multi-sentence video description with variable level of detail, in Proc. 36^th German Conf. Pattern Recognition, Münster, Germany, 2014, pp. 184–195.

[76]

S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko, Translating videos to natural language using deep recurrent neural networks, in Proc. 2015 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 2015, pp. 1494–1504.

[77]

S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko, Sequence to sequence-video to text, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 4534–4542.

[78]

J. Xu, T. Yao, Y. Zhang, and T. Mei, Learning multimodal attention LSTM networks for video captioning, in Proc. 25^th ACM Int. Conf. Multimedia, Mountain View, CA, USA, 2017, pp. 537–545.

[79]

L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen, Video captioning with attention-based LSTM and semantic consistency, IEEE Trans. Multimedia, vol. 19, no. 9, pp. 2045–2055, 2017.

Crossref Google Scholar

[80]

X. Li, B. Zhao, and X. Lu, MAM-RNN: Multi-level attention model based RNN for video captioning, in Proc. 26^th Int. Joint Conf. Artificial Intelligence, Melbourne, Australia, 2017, pp. 2208–2214.

[81]

C. Yan, Y. Tu, X. Wang, Y. Zhang, X. Hao, Y. Zhang, and Q. Dai, STAT: Spatial-temporal attention mechanism for video captioning, IEEE Trans. Multimedia, vol. 22, no. 1, pp. 229–241, 2020.

Crossref Google Scholar

[82]

B. Zhao, X. Li, and X. Lu, Cam-RNN: Co-attention model based RNN for video captioning, IEEE Trans. Image Process., vol. 28, no. 11, pp. 5552–5565, 2019.

Crossref Google Scholar

[83]

S. Chen and Y. G. Jiang, Motion guided spatial attention for video captioning, Proc. AAAI Conf. Artif. Intell., vol. 33, no. 1, pp. 8191–8198, 2019.

Crossref Google Scholar

[84]

L. Gao, X. Wang, J. Song, and Y. Liu, Fused GRU with semantic-temporal attention for video captioning, Neurocomputing, vol. 395, pp. 222–228, 2020.

Crossref Google Scholar

[85]

B. Shi, L. Ji, Z. Niu, N. Duan, M, Zhou, and X. Chen, Learning semantic concepts and temporal alignment for narrated video procedural captioning, in Proc. 28^th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 4355–4363.

[86]

L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville, Describing videos by exploiting temporal structure, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 4507–4515.

[87]

L. Wang, H. Li, H. Qiu, Q. Wu, F. Meng, and K. N. Ngan, POS-trends dynamic-aware model for video caption, IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4751–4764, 2022.

Crossref Google Scholar

[88]

W. Pei, J. Zhang, X. Wang, L. Ke, X. Shen, and Y. W. Tai, Memory-attended recurrent network for video captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 8339–8348.

[89]

J. Deng, L. Li, B. Zhang, S. Wang, Z. Zha, and Q. Huang, Syntax-guided hierarchical attention network for video captioning, IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 2, pp. 880–892, 2022.

Crossref Google Scholar

[90]

H. Ryu, S. Kang, H. Kang, and C. D. Yoo, Semantic grouping network for video captioning, Proc. AAAI Conf. Artif. Intell., vol. 35, no. 3, pp. 2514–2522, 2021.

Crossref Google Scholar

[91]

J. Zhang and Y. Peng, Object-aware aggregation with bidirectional temporal graph for video captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 8319–8328.

[92]

B. Pan, H. Cai, D. A. Huang, K. H. Lee, A. Gaidon, E. Adeli, and J. C. Niebles, Spatio-temporal graph for video captioning with knowledge distillation, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10867–10876.

[93]

Z. Zhang, Y. Shi, C. Yuan, B. Li, P. Wang, W. Hu, and Z. J. Zha, Object relational graph with teacher-recommended learning for video captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 13275–13285.

[94]

Y. Bai, J. Wang, Y. Long, B. Hu, Y. Song, M. Pagnucco, and Y. Guan, Discriminative latent semantic graph for video captioning, in Proc. 29^th ACM Int. Conf. Multimedia, China, 2021, pp. 3556–3564.

[95]

X. Hua, X. Wang, T. Rui, F. Shao, and D. Wang, Adversarial reinforcement learning with object-scene relational graph for video captioning, IEEE Trans. Image Process., vol. 31, pp. 2004–2016, 2022.

Crossref Google Scholar

[96]

B. Wang, L. Ma, W. Zhang, W. Jiang, J. Wang, and W. Liu, Controllable video captioning with POS sequence guidance based on gated fusion network, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 2641–2650.

[97]

J. Hou, X. Wu, W. Zhao, J. Luo, and Y. Jia, Joint syntax representation learning and visual cue translation for video captioning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 8917–8926.

[98]

Q. Zheng, C. Wang, and D. Tao, Syntax-aware action targeting for video captioning, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 13093–13102.

[99]

G. Tan, D. Liu, M. Wang, and Z. J. Zha, Learning to discretely compose reasoning module networks for video captioning, in Proc. 29^th Int. Joint Conf. Artificial Intelligence, Yokohama, Japan, 2021, pp. 745–752.

[100]

X. Wang, W. Chen, J. Wu, Y. F. Wang, and W. Y. Wang, Video captioning via hierarchical reinforcement learning, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4213–4222.

[101]

L. Li and B. Gong, End-to-end video captioning with multitask reinforcement learning, in Proc. 2019 IEEE Winter Conf. Applications of Computer Vision, Waikoloa, HI, USA, 2019, pp. 339–348.

[102]

K. Lin, L. Li, C. C. Lin, F. Ahmed, Z. Gan, Z. Liu, Y. Lu, and L. Wang, SwinBERT: End-to-end transformers with sparse attention for video captioning, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 17928–17937.

[103]

Y. Bin, X. Shang, B. Peng, Y. Ding, and T. S. Chua, Multi-perspective video captioning, in Proc. 29^th ACM Int. Conf. Multimedia, China, 2021, pp. 5110–5118.

[104]

A. Rohrbach, M. Rohrbach, N. Tandon, and B. Schiele, A dataset for movie description, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3202–3212.

[105]

A. Torabi, C. Pal, H. Larochelle, and A. Courville, Using descriptive video services to create a large data source for video annotation research, arXiv preprint arXiv: 1503.01070, 2015.

[106]

J. Xu, T. Mei, T. Yao, and Y. Rui, MSR-VTT: A large video description dataset for bridging video and language, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 5288–5296.

[107]

B. Ghanem, J. C. Niebles, C. Snoek, F. C. Heilbron, H. Alwassel, R. Khrisna, V. Escorcia, K. Hata, and S. Buch, ActivityNet challenge 2017 summary, arXiv preprint arXiv: 1710.08011, 2017.

[108]

X. Wang, J. Wu, J. Chen, L. Li, Y. F. Wang, and W. Y. Wang, VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4580–4590.

[109]

J. Johnson, A. Karpathy, and F. F. Li, DenseCap: Fully convolutional localization networks for dense captioning, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4565–4574.

[110]

G. Yin, L. Sheng, B. Liu, N. Yu, X. Wang, and J. Shao, Context and attribute grounded dense captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6234–6243.

[111]

D. J. Kim, J. Choi, T. H. Oh, and I. S. Kweon, Dense relational captioning: Triple-stream networks for relationship-based captioning, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6264–6273.

[112]

Z. Chen, A. Gholami, M. Nießner, and A. X. Chang, Scan2Cap: Context-aware dense captioning in RGB-D scans, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 3192–3202.

[113]

Z. Yuan, X. Yan, Y. Liao, Y. Guo, G. Li, S. Cui, and Z. Li, X-Trans2Cap: Cross-modal knowledge transfer using transformer for 3D dense captioning, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 8553–8563.

[114]

J. Krause, J. Johnson, R. Krishna, and F. F. Li, A hierarchical approach for generating descriptive image paragraphs, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3337–3345.

[115]

Z. Wang, Y. Luo, Y. Li, Z. Huang, and H. Yin, Look deeper see richer: Depth-aware image paragraph captioning, in Proc. 26^th ACM Int. Conf. Multimedia, Seoul, Republic of Korea, 2018, pp. 672–680.

[116]

J. Wang, Y. Pan, T. Yao, J. Tang, and T. Mei, Convolutional auto-encoding of sentence topics for image paragraph generation, in Proc. 28^th Int. Joint Conf. Artificial Intelligence, Macao, China, 2019, pp. 940–946.

[117]

Y. Liu, Y. Shi, F. Feng, R. Li, Z. Ma, and X. Wang, Improving image paragraph captioning with dual relations, in Proc. 2022 IEEE Int. Conf. Multimedia and Expo, Taipei, China, 2022, pp. 1–6.

[118]

L. Zhou, Y. Kalantidis, X. Chen, J. J. Corso, and M. Rohrbach, Grounded video description, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6571–6580.

[119]

C. Y. Ma, Y. Kalantidis, G. AlRegib, P. Vajda, M. Rohrbach, and Z. Kira, Learning to generate grounded visual captions without localization supervision, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 353–370.

[120]

N. Chen, X. Pan, R. Chen, L. Yang, Z. Lin, Y. Ren, H. Yuan, X. Guo, F. Huang, and W. Wang, Distributed attention for grounded image captioning, in Proc. 29^th ACM Int. Conf. Multimedia, China, 2021, pp. 1966–1975.

[121]

M. Cornia, L. Baraldi, and R. Cucchiara, Show, control and tell: A framework for generating controllable and grounded captions, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 8299–8308.

[122]

C. Deng, N. Ding, M. Tan, and Q. Wu, Length-controllable image captioning, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 712–729.

[123]

S. Chen, Q. Jin, P. Wang, and Q. Wu, Say as you wish: Fine-grained control of image caption generation with abstract scene graphs, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 9959–9968.

[124]

L. Chen, Z. Jiang, J. Xiao, and W. Liu, Human-like controllable image captioning with verb-specific semantic roles, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 16841–16851.

[125]

L. A. Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell, Deep compositional captioning: Describing novel object categories without paired training data, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 1–10.

[126]

Y. Wu, L. Zhu, L. Jiang, and Y. Yang, Decoupled novel object captioner, in Proc. 26^th ACM Int. Conf. Multimedia, Seoul, Republic of Korea, 2018, pp. 1029–1037.

[127]

H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson, nocaps: Novel object captioning at scale, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 8947–8956.

[128]

X. Hu, X. Yin, K. Lin, L. Zhang, J. Gao, L. Wang, and Z. Liu, VIVO: Visual vocabulary pre-training for novel object captioning, Proc. AAAI Conf. Artificial Intelligence, vol. 35, no. 2, pp. 1575–1583, 2021.

Crossref Google Scholar

[129]

D. M. Vo, H. Chen, A. Sugimoto, and H. Nakayama, NOC-REK: Novel object captioning with retrieved vocabulary from external knowledge, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 17979–17987.

[130]

S. Frolov, T. Hinz, F. Raue, J. Hees, and A. Dengel, Adversarial text-to-image synthesis: A review, Neural Netw., vol. 144, pp. 187–209, 2021.

Crossref Google Scholar

[131]

S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, Generative adversarial text to image synthesis, in Proc. 33^rd Int. Conf. Machine Learning, New York, NY, USA, 2016, pp. 1060–1069.

[132]

S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, Vector quantized diffusion model for text-to-image synthesis, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 10686–10696.

[133]

F. A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, Diffusion models in vision: A survey, arXiv preprint arXiv: 2209.04747, 2022.

[134]

J. Mao, J. Huang, A. Toshev, O. Camburu, A. Yuille, and K. Murphy, Generation and comprehension of unambiguous object descriptions, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 11–20.

[135]

L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, Modeling context in referring expressions, in Proc. 14^th European Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 69–85.

[136]

J. Liu, L. Wang, and M. H. Yang, Referring expression generation and comprehension via attributes, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 4866–4874.

[137]

R. Hu, H. Xu, M. Rohrbach, J. Feng, K. Saenko, and T. Darrell, Natural language object retrieval, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 4555–4564.

[138]

V. K. Nagaraja, V. I. Morariu, and L. S. Davis, Modeling context between objects for referring expression understanding, in Proc. 14^th Eur. Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 792–807.

[139]

H. Zhang, Y. Niu, and S. Fu. Chang, Grounding referring expressions in images by variational context, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4158–4166.

[140]

B. Zhuang, Q. Wu, C. Shen, I. Reid, and A. Van Den Hengel, Parallel attention: A unified framework for visual object discovery through dialogs and queries, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4252–4261.

[141]

C. Deng, Q. Wu, Q. Wu, F. Hu, F. Lyu, and M. Tan, Visual grounding via accumulated attention, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7746–7755.

[142]

R. Hu, M. Rohrbach, J. Andreas, T. Darrell, and K. Saenko, Modeling relationships in referential expressions with compositional modular networks, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 4418–4427.

[143]

L. Yu, Z. Lin, X. Shen, J. Yang, X. Lu, M. Bansal, and T. L. Berg, MAttNet: Modular attention network for referring expression comprehension, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1307–1315.

[144]

X. Liu, Z. Wang, J. Shao, X. Wang, and H. Li, Improving referring expression grounding with cross-modal attention-guided erasing, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1950–1959.

[145]

V. Cirik, T. Berg-Kirkpatrick, and L. P. Morency, Using syntax to ground referring expressions in natural images, in Proc. 32^nd AAAI Conf. Artificial Intelligence and 30^th Innovative Applications of Artificial Intelligence Conf. and 8^th AAAI Symp. Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2018, pp. 6756–6764.

[146]

D. Liu, H. Zhang, Z. J. Zha, and F. Wu, Learning to assemble neural module tree networks for visual grounding, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4672–4681.

[147]

D. Chen and C. Manning, A fast and accurate dependency parser using neural networks, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing, Doha, Qatar, 2014, pp. 740–750.

[148]

P. Wang, Q. Wu, J. Cao, C. Shen, L. Gao, and A. van den Hengel, Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1960–1968.

[149]

S. Yang, G. Li, and Y. Yu, Dynamic graph attention for referring expression comprehension, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4643–4652.

[150]

S. Yang, G. Li, and Y. Yu, Graph-structured referring expression reasoning in the wild, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 9949–9958.

[151]

S. Schuster, R. Krishna, A. Chang, F. F. Li, and C. D. Manning, Generating semantically precise scene graphs from textual descriptions for improved image retrieval, in Proc. 4^th Workshop on Vision and Language, Lisbon, Portugal, 2015, pp. 70–80.

[152]

C. Jing, Y. Wu, M. Pei, Y. Hu, Y. Jia, and Q. Wu, Visual-semantic graph matching for visual grounding, in Proc. 28^th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 4041–4050.

[153]

T. Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, Feature pyramid networks for object detection, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 936–944.

[154]

Z. Yang, B. Gong, L. Wang, W. Huang, D. Yu, and J. Luo, A fast and accurate one-stage approach to visual grounding, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 4682–4692.

[155]

J. Redmon and A. Farhadi, YOLOv3: An incremental improvement, arXiv preprint arXiv: 1804.02767, 2018.

[156]

Z. Yang, T. Chen, L. Wang, and J. Luo, Improving one-stage visual grounding by recursive sub-query construction, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 387–404.

[157]

B. Huang, D. Lian, W. Luo, and S. Gao, Look before you leap: Learning landmark features for one-stage visual grounding, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 16883–16892.

[158]

Y. Liao, S. Liu, G. Li, F. Wang, Y. Chen, C. Qian, and B. Li, A real-time cross-modality correlation filtering method for referring expression comprehension, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10877–10886.

[159]

X. Zhou, D. Wang, and P. Krähenbühl, Objects as points, arXiv preprint arXiv: 1904.07850, 2019.

[160]

H. Qiu, H. Li, Q. Wu, F. Meng, H. Shi, T. Zhao, and K. N. Ngan, Language-aware fine-grained object representation for referring expression comprehension, in Proc. 28^th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 4171–4180.

[161]

Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, RepPoints: Point set representation for object detection, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 9656–9665.

[162]

J. Deng, Z. Yang, T. Chen, W. Zhou, and H. Li, TransVG: End-to-end visual grounding with transformers, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 1749–1759.

[163]

G. Luo, Y. Zhou, X. Sun, L. Cao, C. Wu, C. Deng, and R. Ji, Multi-task collaborative network for joint referring expression comprehension and segmentation, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10031–10040.

[164]

M. Li and L. Sigal, Referring transformer: A one-step approach to multi-task visual grounding, in Proc. 35^th Conf. Neural Information Processing Systems, Vancouver, Canada, 2021, pp. 19652–19664.

[165]

M. Wang, M. Azab, N. Kojima, R. Mihalcea, and J. Deng, Structured matching for phrase localization, in Proc. 14^th European Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 696–711.

[166]

K. Chen, R. Kovvuri, and R. Nevatia, Query-guided regression network with context policy for phrase grounding, in Proc. 2017 IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 824–832.

[167]

P. Dogan, L. Sigal, and M. Gross, Neural sequential phrase grounding (SeqGROUND), in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 4170–4179.

[168]

Y. Liu, B. Wan, X. Zhu, and X. He, Learning cross-modal context graph for visual grounding, Proc. AAAI Conf. Artif. Intell., vol. 34, no. 7, pp. 11645–11652, 2020.

Crossref Google Scholar

[169]

S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, ReferItGame: Referring to objects in photographs of natural scenes, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing, Doha, Qatar, 2014, pp. 787–798.

[170]

M. Grubinger, P. Clough, H. Müller, and T. Deselaers, The IAPR TC-12 benchmark: A new evaluation resource for visual information systems, In International workshop ontoImage,http://thomas.deselaers.de/publications/papers/grubinger_lrec06.pdf.

[171]

H. De Vries, F. Strub, S. Chandar, O. Pietquin, H. Larochelle, and A. Courville, GuessWhat?! Visual object discovery through multi-modal dialogue, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 4466–4475.

[172]

R. Liu, C. Liu, Y. Bai, and A. L. Yuille, CLEVR-Ref+: Diagnosing visual reasoning with referring expressions, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 4180–4189.

[173]

J. Johnson, B. Hariharan, L. Van Der Maaten, F. F. Li, C. L. Zitnick, and R. Girshick, CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1988–1997.

[174]

Z. Chen, P. Wang, L. Ma, K. Y. K. Wong, and Q. Wu, Cops-ref: A new dataset and task on compositional referring expression comprehension, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10083–10092.

[175]

R. Hu, M. Rohrbach, and T. Darrell, Segmentation from natural language expressions, in Proc. 14^th European Conf. Computer Vision, Amsterdam, Netherlands, 2016, pp. 108–124.

[176]

C. Liu, Z. Lin, X. Shen, J. Yang, X. Lu, and A. Yuille, Recurrent multimodal interaction for referring image segmentation, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 1280–1289.

[177]

R. Li, K. Li, Y. C. Kuo, M. Shu, X. Qi, X. Shen, and J. Jia, Referring image segmentation via recurrent refinement networks, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 5745–5753.

[178]

E. Margffoy-Tuay, J. C. Pérez, E. Botero, and P. Arbeláez, Dynamic multimodal instance segmentation guided by natural language queries, in Proc. 15^th European Conf. Computer Vision, Munich, Germany, 2018, pp. 656–672.

[179]

D. J. Chen, S. Jia, Y. C. Lo, H. T. Chen, and T. L. Liu, See-through-text grouping for referring image segmentation, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 7453–7462.

[180]

H. Shi, H. Li, F. Meng, and Q. Wu, Key-word-aware network for referring expression image segmentation, in Proc. 15^th European Conf. Computer Vision, Munich, Germany, 2018, pp. 38–54.

[181]

L. Ye, M. Rochan, Z. Liu, and Y. Wang, Cross-modal self-attention network for referring image segmentation, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 10494–10503.

[182]

Z. Hu, G. Feng, J. Sun, L. Zhang, and H. Lu, Bi-directional relationship inferring network for referring image segmentation, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 4423–4432.

[183]

G. Feng, Z. Hu, L. Zhang, and H. Lu, Encoder fusion network with co-attention embedding for referring image segmentation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 15501–15510.

[184]

G. Luo, Y. Zhou, R. Ji, X. Sun, J. Su, C. W. Lin, and Q. Tian, Cascade grouped attention network for referring expression segmentation, in Proc. 28^th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 1274–1282.

[185]

S. Huang, T. Hui, S. Liu, G. Li, Y. Wei, J. Han, L. Liu, and B. Li, Referring image segmentation via cross-modal progressive comprehension, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10485–10494.

[186]

T. Hui, S. Liu, S. Huang, G. Li, S. Yu, F. Zhang, and J. Han, Linguistic structure guided context modeling for referring image segmentation, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 59–75.

[187]

S. Yang, M. Xia, G. Li, H. Y. Zhou, and Y. Yu, Bottom-up shift and reasoning for referring image segmentation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 11261–11270.

[188]

Y. Jing, T. Kong, W. Wang, L. Wang, L. Li, and T. Tan, Locate then segment: A strong pipeline for referring image segmentation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 9853–9862.

[189]

H. Ding, C. Liu, S. Wang, and X. Jiang, Vision-language transformer and query generation for referring segmentation, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 16301–16310.

[190]

H. J. Escalante, C. A. Hernández, J. A. Gonzalez, A. López-López, M. Montes, E. F. Morales, L. E. Sucar, L. Villaseñor, and M. Grubinger, The segmented and annotated IAPR TC-12 benchmark, Comput. Vis. Image Underst., vol. 114, no. 4, pp. 419–428, 2010.

Crossref Google Scholar

[191]

L. Peng, Y. Yang, Z. Wang, Z. Huang, and H. T. Shen, MRA-net: Improving VQA via multi-modal relation attention network, IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 1, pp. 318–329, 2022.

Crossref Google Scholar

[192]

R. Hu, A. Rohrbach, T. Darrell, and K. Saenko, Language-conditioned graph networks for relational reasoning, in Proc. 2019 IEEE/CVF Int. Conf. Computer Vision, Seoul, Republic of Korea, 2019, pp. 10293–10302.

[193]

Q. Cao, X. Liang, B. Li, and L. Lin, Interpretable visual question answering by reasoning on dependency trees, IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 3, pp. 887–901, 2021.

Crossref Google Scholar

[194]

K. Yi, J. Wu, C. Gan, A. Torralba, P. Kohli, and J. B. Tenenbaum, Neural-symbolic VQA: Disentangling reasoning from vision and language understanding, in Proc. 32^nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 1039–1050.

[195]

S. Amizadeh, H. Palangi, A. Polozov, Y. Huang, and K. Koishida, Neuro-symbolic visual reasoning: Disentangling “ visual ” from “ reasoning ”, in Proc. 37^th Int. Conf. Machine Learning, 2020, pp. 279–290.

[196]

Y. Niu, K. Tang, H. Zhang, Z. Lu, X. S. Hua, and J. R. Wen, Counterfactual VQA: A cause-effect look at language bias, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 12695–12705.

[197]

L. Chen, X. Yan, J. Xiao, H. Zhang, S. Pu, and Y. Zhuang, Counterfactual samples synthesizing for robust visual question answering, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10797–10806.

[198]

J. Lu, D. Batra, D. Parikh, and S. Lee, ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks, in Proc. 33^rd Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2019, pp. 13–23.

[199]

Z. Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, P. Zhang, L. Yuan, N. Peng, et al. , An empirical study of training end-to-end vision-and-language transformers, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 18145–18155.

[200]

P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, VinVL: Revisiting visual representations in vision-language models, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 5575–5584.

[201]

J. Li, R. R. Selvaraju, A. D. Gotmare, S. Joty, C. Xiong, and S. Hoi, Align before fuse: Vision and language representation learning with momentum distillation, arXiv preprint arXiv: 2107.07651, 2021.

[202]

S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, VQA: Visual question answering, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 2425–2433.

[203]

C. L. Zitnick and D. Parikh, Bringing semantics into focus using visual abstraction, in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 3009–3016.

[204]

Y. Goyal, T. Khot, A. Agrawal, D. Summers-Stay, D. Batra, and D. Parikh, Making the V in VQA matter: Elevating the role of image understanding in visual question answering, Int. J. Comput. Vis., vol. 127, no. 4, pp. 398–414, 2019.

Crossref Google Scholar

[205]

A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi, Don’t just assume; look and answer: Overcoming priors for visual question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4971–4980.

[206]

D. A. Hudson and C. D. Manning, GQA: A new dataset for real-world visual reasoning and compositional question answering, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6693–6702.

[207]

R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. J. Li, D. A. Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, Int. J. Comput. Vis., vol. 123, no. 1, pp. 32–73, 2017.

Crossref Google Scholar

[208]

Y. Hong, L. Yi, J. B. Tenenbaum, A. Torralba, and C. Gan, PTR: A benchmark for part-based conceptual, relational, and physical reasoning, in Proc. 35^th Neural Information Processing Systems, 2021, pp. 17427–17440.

[209]

P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sünderhauf, I. Reid, S. Gould, and A. van den Hengel, Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 3674–3683.

[210]

A. Majumdar, A. Shrivastava, S. Lee, P. Anderson, D. Parikh, and D. Batra, Improving vision-and-language navigation with image-text pairs from the web, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 259–274.

[211]

W. Hao, C. Li, X. Li, L. Carin, and J. Gao, Towards learning a generic agent for vision-and-language navigation via pre-training, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, 2020, pp. 13134–13143.

[212]

Y. Qi, Z. Pan, S. Zhang, A. van den Hengel, and Q. Wu, Object-and-action aware model for visual language navigation, in Proc. 16^th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 303–317.

[213]

K. Chen, J. K. Chen, J. Chuang, M. Vazquez, and S. Savarese, Topological planning with transformers for vision-and-language navigation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 11271–11281.

[214]

Y. Zhu, F. Zhu, Z. Zhan, B. Lin, J. Jiao, X. Chang, and X. Liang, Vision-dialog navigation by exploring cross-modal memory, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10727–10736.

[215]

F. Zhu, Y. Zhu, X. Chang, and X. Liang, Vision-language navigation with self-supervised auxiliary reasoning tasks, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10009–10019.

[216]

A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, Embodied question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1–10.

[217]

A. Das, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, Neural modular control for embodied question answering, in Proc. 2^nd Conf. Robot Learning, Zürich, Switzerland, 2018, pp. 53–62.

[218]

N. Ilinykh, Y. Emampoor, and S. Dobnik, Look and answer the question: On the role of vision in embodied question answering, in Proc. 15^th Int. Conf. Natural Language Generation, Waterville, ME, USA, 2022, pp. 236–245.

[219]

A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. F. Moura, D. Parikh, and D. Batra, Visual dialog, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1080–1089.

[220]

J. Qi, Y. Niu, J. Huang, and H. Zhang, Two causal principles for improving visual dialog, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10857–10866.

[221]

Y. Niu, H. Zhang, M. Zhang, J. Zhang, Z. Lu, and J. R. Wen, Recursive visual attention in visual dialog, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 6672–6681.

[222]

S. Zhang, X. Jiang, Z. Yang, T. Wan, and Z. Qin, Reasoning with multi-structure commonsense knowledge in visual dialog, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition Workshops, New Orleans, LA, USA, 2022, pp. 4599–4608.

[223]

H. Tan and M. Bansal, LXMERT: Learning cross-modality encoder representations from transformers, in Proc. 2019 Conf. Empirical Methods in Natural Language Processing and the 9^th Int. Joint Conf. Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 5100–5111.

[224]

G. Li, N. Duan, Y. Fang, M. Gong, D. Jiang, and M. Zhou, Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training, arXiv preprint arXiv: 1908.06066, 2019.

[225]

W. Kim, B. Son, and I. Kim, ViLT: Vision-and-language transformer without convolution or region supervision, in Proc. 38^th Int. Conf. Machine Learning, 2021, pp. 5583–5594.

[226]

Y. Li, H. Fan, R. Hu, C. Feichtenhofer, and K. He, Scaling language-image pre-training via masking, arXiv preprint arXiv: 2212.00794, 2022.

[227]

S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg, et al. , A generalist agent, arXiv preprint arXiv: 2205.06175, 2022.

[228]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. , Language models are few-shot learners, in Proc. 34^th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 1877–1901.

[229]

T. Schick and H. Schütze, Exploiting cloze-questions for few-shot text classification and natural language inference, in Proc. 16^th Conf. European Chapter of the Association for Computational Linguistics, 2021, pp. 255–269.

[230]

Y. Yao, A. Zhang, Z. Zhang, Z. Liu, T. S. Chua, and M. Sun, CPT: Colorful prompt tuning for pre-trained vision-language models, arXiv preprint arXiv: 2109.11797, 2022.

[231]

J. B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al., Flamingo: A visual language model for few-shot learning, arXiv preprint arXiv: 2204.14198, 2022.

[232]

Y. Liu, W. Wei, D. Peng, and F. Zhu, Declaration-based prompt tuning for visual question answering, in Proc. 31^st Int. Joint Conf. Artificial Intelligence, Vienna, Austria, 2022, pp. 3264–3270.

[233]

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. , Learning transferable visual models from natural language supervision, in Proc. 38^th Int. Conf. Machine Learning, 2021, pp. 8748–8763.

[234]

J. Wang, W. Wang, Y. Huang, L. Wang, and T. Tan, Hierarchical memory modelling for video captioning, in Proc. 26^th ACM Int. Conf. Multimedia, Seoul, Republic of Korea, 2018, pp. 63–71.

[235]

Y. Chen, S. Wang, W. Zhang, and Q. Huang, Less is more: Picking informative frames for video captioning, in Proc. 15^th European Conf. Computer Vision, Munich, Germany, 2018, pp. 367–384.

[236]

B. Wang, L. Ma, W. Zhang, and W. Liu, Reconstruction network for video captioning, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7622–7631.

[237]

Y. Hu, Z. Chen, Z. J. Zha, and F. Wu, Hierarchical global-local temporal modeling for video captioning, in Proc. 27^th ACM Int. Conf. Multimedia, Nice, France, 2019, pp. 774–783.

[238]

Y. Zhu and S. Jiang, Attention-based densely connected LSTM for video captioning, In Proc. 27^th ACM Int. Conf. Multimedia, Nice, France, 2019, pp. 802–810.

[239]

S. Liu, Z. Ren, and J. Yuan, SibNet: Sibling convolutional encoder for video captioning, in Proc. 26^th ACM Int. Conf. Multimedia, Seoul, Republic of Korea, 2018, pp. 1425–1434.

[240]

Y. Lei, Z. He, P. Zeng, J. Song, and L. Gao, Hierarchical representation network with auxiliary tasks for video captioning, in Proc. 2021 IEEE Int. Conf. Multimedia and Expo, Shenzhen, China, 2021, pp. 1–6.

[241]

F. Liu, X. Ren, X. Wu, S. Ge, W. Fan, Y. Zou, and X. Sun, Prophet attention: Predicting attention with future attention, in Proc. 34^th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2020, pp. 1865–1876.

[242]

W. Zhang, H. Shi, S. Tang, J. Xiao, Q. Yu, and Y. Zhuang, Consensus graph representation learning for better grounded image captioning, in Proc. AAAI Conf. Artif. Intell., vol. 35, no. 4, pp. 3394–3402, 2021.

[243]

V. Cirik, L. P. Morency, and T. Berg-Kirkpatrick, Visual referring expression recognition: What do systems actually learn? arXiv preprint arXiv: 1805.11818, 2018.

[244]

R. M. French, Catastrophic forgetting in connectionist networks, Trends Cogn. Sci., vol. 3, no. 4, pp. 128–135, 1999.

Crossref Google Scholar

CAAI Artificial Intelligence Research

Volume 1 Issue 2,
December 2022

Pages 111-136

DOI: 10.26599/AIR.2022.9150008

Cite this article:

Wang L, Hu W, Qiu H, et al. A Survey of Vision and Language Related Multi-Modal Task. CAAI Artificial Intelligence Research, 2022, 1(2): 111-136. https://doi.org/10.26599/AIR.2022.9150008

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号