Abstract
This paper introduces ProFiC-VLP, a fine-grained vision-language pre-training (VLP) model featuring progressive freezing and feedback-controlled cropping. Our ProFiC-VLP model addresses the challenge of fine-grained image-text semantic relationship alignment in VLP, specifically in aligning visual regions with their corresponding textual phrases. ProFiC-VLP integrates three levels of image-text semantic alignment: global alignment, object alignment, and relationship alignment. It employs vision transformer for image encoding and a semantic parser to decompose text into multi-level semantic structures, which are then encoded into text semantics using lightweight querying transformers. To prevent overfitting during pre-training, ProFiC-VLP introduces a progressive freezing strategy that aligns multi-level textual representations with the corresponding image concepts. Furthermore, a novel language generation approach with a feedback-controlled cropping loss function is proposed for relationship alignment pre-training, which effectively localizes image regions corresponding to specific textual phrases. Experimental results show that ProFiC-VLP outperforms existing models across various vision-language tasks, especially in cross-modal reasoning scenarios such as visual grounding and image caption generation. Visualization experiments further underscore ProFiC-VLP’s ability to achieve fine-grained alignment of semantic relationships, significantly enhancing its capacity to address complex vision-language challenges.