A Fine-grained Vision-Language Pretraining Model with Progressive Freezing and Feedback-Controlled Cropping

Yang Qin; Shuxue Ding; Huiming Xie; Benying Tan; Yujie Li

doi:10.26599/TST.2025.9010111

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (5.1 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Research Article | Open Access | Online First

A Fine-grained Vision-Language Pretraining Model with Progressive Freezing and Feedback-Controlled Cropping

Yang Qin^¹, Shuxue Ding^¹(

), Huiming Xie^²(

), Benying Tan^¹, Yujie Li^¹

1Guangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, School of Artificial Intelligence, Guilin University of Electronic Technology, Guilin 541004, China

2Engineering Comprehensive Training Center, Guilin University of Aerospace Technology, Guilin 541004, China

Show Author Information

Abstract

This paper introduces ProFiC-VLP, a fine-grained vision-language pre-training (VLP) model featuring progressive freezing and feedback-controlled cropping. ProFiC-VLP model addresses the challenge of fine-grained image-text semantic relationship alignment in VLP, specifically in aligning visual regions with their corresponding textual phrases. ProFiC-VLP integrates three levels of image-text semantic alignment: global alignment, object alignment, and relationship alignment. It employs vision transformer for image encoding and a semantic parser to decompose text into multi-level semantic structures, which are then encoded into text semantics using lightweight querying transformers. To prevent overfitting during pre-training, ProFiC-VLP introduces a progressive freezing strategy that aligns multi-level textual representations with the corresponding image concepts. Furthermore, a novel language generation approach with a feedback-controlled cropping loss function is proposed for relationship alignment pre-training, which effectively localizes image regions corresponding to specific textual phrases. Experimental results show that ProFiC-VLP outperforms existing models across various vision-language tasks, especially in cross-modal reasoning scenarios such as visual grounding and image caption generation. Visualization experiments further underscore ProFiC-VLP’s ability to achieve fine-grained alignment of semantic relationships, significantly enhancing its capacity to address complex vision-language challenges.

Keywords

vision-language pre-training (VLP)Transformer multi-level alignment

References

【1】

Crossref Google Scholar

Tsinghua Science and Technology

DOI: 10.26599/TST.2025.9010111

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Qin Y, Ding S, Xie H, et al. A Fine-grained Vision-Language Pretraining Model with Progressive Freezing and Feedback-Controlled Cropping. Tsinghua Science and Technology, 2025, https://doi.org/10.26599/TST.2025.9010111

2374

Views

123

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 29 September 2024

Revised: 03 May 2025

Accepted: 20 June 2025

Published: 17 October 2025

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).