Cross-Modal Retrieval from Coarse-Grained to Fine-Grained Perspectives: A Survey

Yu-Xin Peng; Ming-Hang Zheng; Yang Liu

doi:10.1007/s11390-026-5922-5

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Survey

Cross-Modal Retrieval from Coarse-Grained to Fine-Grained Perspectives: A Survey

Yu-Xin Peng, Ming-Hang Zheng, Yang Liu

Wangxuan Institute of Computer Technology, Peking University, Beijing 100871, China

Show Author Information

Abstract

Cross-modal retrieval (CMR) has become a fundamental technique in multimedia understanding and recommendation systems, enabling information retrieval across heterogeneous modalities such as images, videos, and text. While several prior surveys have reviewed the progress of CMR, they are limited by outdated taxonomies and insufficient coverage of recent developments. In particular, most surveys focus on coarse-grained retrieval, which retrieves entire instances given a query, while neglecting fine-grained tasks that require retrieving at a finer semantic level to distinguish different subcategories or retrieving only a specific part of the instance, such as a region within an image or a segment within a video. Moreover, due to the rapid development of large-scale vision-language pre-training (VLP) models and multimodal large language models (MLLMs), many existing surveys fail to capture the impact of these transformative advancements on CMR. To address these gaps, in this survey, we provide a unified taxonomy, categorizing CMR into coarse-grained cross-modal retrieval (CCMR) and fine-grained cross-modal retrieval (FCMR). CCMR aims to retrieve the whole instance based on the given query, such as image-text and video-text retrieval. FCMR aims to distinguish and retrieve specific subordinate-level fine-grained categories within a super-class, or retrieve a part of the instance, such as image grounding and video temporal grounding. Taking both these types into consideration brings a broad view to CMR, bridges the gap between disparate tasks, and offers a comprehensive overview of the field. We review major methodological paradigms, including recent VLP-based and MLLM-based approaches, and summarize widely used datasets and evaluation protocols. Beyond systematic performance comparisons, we also discuss applications and insights for future research.

Keywords

cross-modal retrieval image-text retrieval video-text retrieval image grounding video temporal grounding

Electronic Supplementary Material

Video

JCST-2509-15922-Video.mp4

Download File(s)

JCST-2509-15922-Highlights.pdf (1 MB)

References

【1】

Crossref Google Scholar

Journal of Computer Science and Technology

Volume 41 Issue 1,
April 2026

Pages 359-393

DOI: 10.1007/s11390-026-5922-5

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Peng Y-X, Zheng M-H, Liu Y. Cross-Modal Retrieval from Coarse-Grained to Fine-Grained Perspectives: A Survey. Journal of Computer Science and Technology, 2026, 41(1): 359-393. https://doi.org/10.1007/s11390-026-5922-5

123

Views

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 08 September 2025

Accepted: 22 January 2026

Published: 30 April 2026