AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Survey

Cross-Modal Retrieval from Coarse-Grained to Fine-Grained Perspectives: A Survey

Wangxuan Institute of Computer Technology, Peking University, Beijing 100871, China
Show Author Information

Abstract

Cross-modal retrieval (CMR) has become a fundamental technique in multimedia understanding and recommendation systems, enabling information retrieval across heterogeneous modalities such as images, videos, and text. While several prior surveys have reviewed the progress of CMR, they are limited by outdated taxonomies and insufficient coverage of recent developments. In particular, most surveys focus on coarse-grained retrieval, which retrieves entire instances given a query, while neglecting fine-grained tasks that require retrieving at a finer semantic level to distinguish different subcategories or retrieving only a specific part of the instance, such as a region within an image or a segment within a video. Moreover, due to the rapid development of large-scale vision-language pre-training (VLP) models and multimodal large language models (MLLMs), many existing surveys fail to capture the impact of these transformative advancements on CMR. To address these gaps, in this survey, we provide a unified taxonomy, categorizing CMR into coarse-grained cross-modal retrieval (CCMR) and fine-grained cross-modal retrieval (FCMR). CCMR aims to retrieve the whole instance based on the given query, such as image-text and video-text retrieval. FCMR aims to distinguish and retrieve specific subordinate-level fine-grained categories within a super-class, or retrieve a part of the instance, such as image grounding and video temporal grounding. Taking both these types into consideration brings a broad view to CMR, bridges the gap between disparate tasks, and offers a comprehensive overview of the field. We review major methodological paradigms, including recent VLP-based and MLLM-based approaches, and summarize widely used datasets and evaluation protocols. Beyond systematic performance comparisons, we also discuss applications and insights for future research.

Electronic Supplementary Material

Video
JCST-2509-15922-Video.mp4
Download File(s)
JCST-2509-15922-Highlights.pdf (1 MB)

References

【1】
【1】
 
 
Journal of Computer Science and Technology
Pages 359-393

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Peng Y-X, Zheng M-H, Liu Y. Cross-Modal Retrieval from Coarse-Grained to Fine-Grained Perspectives: A Survey. Journal of Computer Science and Technology, 2026, 41(1): 359-393. https://doi.org/10.1007/s11390-026-5922-5

123

Views

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Received: 08 September 2025
Accepted: 22 January 2026
Published: 30 April 2026
© Institute of Computing Technology, Chinese Academy of Sciences 2026