AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (2.2 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Review | Open Access

A Survey of Vision and Language Related Multi-Modal Task

Lanxiao Wang1Wenzhe Hu1Heqian Qiu1Chao Shang1Taijin Zhao1Benliu Qiu1King Ngi Ngan2Hongliang Li1( )
Department of Information and Communication Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
The Chinese University of Hong Kong, Hong Kong 999077, China
Show Author Information

Abstract

With the significant breakthrough in the research of single-modal related deep learning tasks, more and more works begin to focus on multi-modal tasks. Multi-modal tasks usually involve more than one different modalities, and a modality represents a type of behavior or state. Common multi-modal information includes vision, hearing, language, touch, and smell. Vision and language are two of the most common modalities in human daily life, and many typical multi-modal tasks focus on these two modalities, such as visual captioning and visual grounding. In this paper, we conduct in-depth research on typical tasks of vision and language from the perspectives of generation, analysis, and reasoning. First, the analysis and summary with the typical tasks and some pretty classical methods are introduced, which will be generalized from the aspects of different algorithmic concerns, and be further discussed frequently used datasets and metrics. Then, some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework. Finally, we further discuss the development of pre-training related research and make an outlook for future research. We hope this survey can help relevant researchers to understand the latest progress, existing problems, and exploration directions of vision and language multi-modal related tasks, and provide guidance for future research.

References

【1】
【1】
 
 
CAAI Artificial Intelligence Research
Pages 111-136

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Wang L, Hu W, Qiu H, et al. A Survey of Vision and Language Related Multi-Modal Task. CAAI Artificial Intelligence Research, 2022, 1(2): 111-136. https://doi.org/10.26599/AIR.2022.9150008

7254

Views

943

Downloads

2

Crossref

Received: 04 July 2022
Revised: 15 December 2022
Accepted: 27 December 2022
Published: 10 March 2023
© The author(s) 2022

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).