AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (7.4 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access | Just Accepted

A Fine-grained Vision-Language Pretraining Model with Progressive Freezing and Feedback-Controlled Cropping

Yang Qin1Shuxue Ding1( )Huiming Xie2( )Benying Tan1Yujie Li1

1 Guangxi Colleges and Universities Key Laboratory of AI Algorithm Engineering, School of Artificial Intelligence, Guilin University of Electronic Technology, No.1 Jinji Road, Guilin 541004, China

2 Engineering comprehensive training center, Guilin University of Aerospace Technology, No.2 Jinji Road, Guilin 541004, China

Show Author Information

Abstract

This paper introduces ProFiC-VLP, a fine-grained vision-language pre-training (VLP) model featuring progressive freezing and feedback-controlled cropping. Our ProFiC-VLP model addresses the challenge of fine-grained image-text semantic relationship alignment in VLP, specifically in aligning visual regions with their corresponding textual phrases. ProFiC-VLP integrates three levels of image-text semantic alignment: global alignment, object alignment, and relationship alignment. It employs vision transformer for image encoding and a semantic parser to decompose text into multi-level semantic structures, which are then encoded into text semantics using lightweight querying transformers. To prevent overfitting during pre-training, ProFiC-VLP introduces a progressive freezing strategy that aligns multi-level textual representations with the corresponding image concepts. Furthermore, a novel language generation approach with a feedback-controlled cropping loss function is proposed for relationship alignment pre-training, which effectively localizes image regions corresponding to specific textual phrases. Experimental results show that ProFiC-VLP outperforms existing models across various vision-language tasks, especially in cross-modal reasoning scenarios such as visual grounding and image caption generation. Visualization experiments further underscore ProFiC-VLP’s ability to achieve fine-grained alignment of semantic relationships, significantly enhancing its capacity to address complex vision-language challenges.

Tsinghua Science and Technology
Cite this article:
Qin Y, Ding S, Xie H, et al. A Fine-grained Vision-Language Pretraining Model with Progressive Freezing and Feedback-Controlled Cropping. Tsinghua Science and Technology, 2025, https://doi.org/10.26599/TST.2025.9010111

54

Views

15

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 29 September 2024
Revised: 03 May 2025
Accepted: 20 June 2025
Available online: 01 July 2025

© The author(s) 2025

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return