AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Regular Paper

Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval

School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
School of Computer Science, Shaanxi Normal University, Xi’an 710119, China
School of Computer Engineering, Weifang University, Weifang 261061, China
Guangxi Colleges and Universities Key Laboratory of Intelligent Industry Software, Wuzhou University, Wuzhou 543002 China
Show Author Information

Abstract

Image-text retrieval aims to capture the semantic correspondence between images and texts, which serves as a foundation and crucial component in multi-modal recommendations, search systems, and online shopping. Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval. To this end, a multi-task visual semantic embedding network (MVSEN) is proposed for image-text retrieval. Specifically, we design two auxiliary tasks, including text-text matching and multi-label classification, for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective. Besides, we present an intra- and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities. Subsequently, we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs. Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets, Flickr30K and MSCOCO, with rSum improvements of 8.2% and 3.0%, respectively.

Electronic Supplementary Material

Download File(s)
JCST-2401-14125-Highlights.pdf (157.4 KB)

References

[1]

Zhao G S, Zhang C F, Shang H, Wang Y X, Zhu L, Qian X M. Generative label fused network for image-text matching. Knowledge-Based Systems , 2023, 263: 110280. DOI: 10.1016/j.knosys.2023.110280.

[2]

Qin X Y, Li L S, Hao F, Pang G Y, Wang Z H. Cross-modal information balance-aware reasoning network for image-text retrieval. Engineering Applications of Artificial Intelligence , 2023, 120: 105923. DOI: 10.1016/j.engappai.2023.105923.

[3]

Liu K, Xue F, Guo D, Sun P J, Qian S S, Hong R C. Multimodal graph contrastive learning for multimedia-based recommendation. IEEE Trans. Multimedia , 2023, 25: 9343–9355. DOI: 10.1109/TMM.2023.3251108.

[4]

Wu Y X, Liao L Z, Zhang G Y, Lei W Q, Zhao G S, Qian X M, Chua T S. State graph reasoning for multimodal conversational recommendation. IEEE Trans. Multimedia , 2023, 25: 3113–3124. DOI: 10.1109/TMM.2022.3155900.

[5]

Wen Z, Peng Y X. Multi-level knowledge injecting for visual commonsense reasoning. IEEE Trans. Circuits and Systems for Video Technology , 2021, 31(3): 1042–1054. DOI: 10.1109/TCSVT.2020.2991866.

[6]

Li Z Y, Guo Y Y, Wang K J, Wei Y W, Nie L Q, Kankanhalli M. Joint answering and explanation for visual commonsense reasoning. IEEE Trans. Image Processing , 2023, 32: 3836–3846. DOI: 10.1109/TIP.2023.3286259.

[7]

Wang L W, Li Y, Huang J, Lazebnik S. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Analysis and Machine Intelligence , 2019, 41(2): 394–407. DOI: 10.1109/TPAMI.2018.2797921.

[8]

Liu Y, Guo Y M, Liu L, Bakker E M, Lew M S. CycleMatch: A cycle-consistent embedding network for image-text matching. Pattern Recognition , 2019, 93: 365–379. DOI: 10.1016/j.patcog.2019.05.008.

[9]
Li K P, Zhang Y L, Li K, Li Y Y, Fu Y. Visual semantic reasoning for image-text matching. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27–Nov. 2, 2019, pp.4654–4662. DOI: 10.1109/iccv.2019.00475.
[10]
Sarafianos N, Xu X, Kakadiaris I A. Adversarial representation learning for text-to-image matching. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27–Nov. 2, 2019, pp.5814–5824. DOI: 10.1109/iccv.2019.00591.
[11]

Peng Y X, Qi J W. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Computing, Communications, and Applications , 2019, 15(1): Article No. 22. DOI: 10.1145/3284750.

[12]

Chi J Z, Peng Y X. Zero-shot cross-media embedding learning with dual adversarial distribution network. IEEE Trans. Circuits and Systems for Video Technology , 2020, 30(4): 1173–1187. DOI: 10.1109/TCSVT.2019.2900171.

[13]

Xie Y C, Zeng X H, Wang T H, Xu L M, Wang D J. Multiple deep neural networks with multiple labels for cross-modal hashing retrieval. Engineering Applications of Artificial Intelligence , 2022, 114: 105090. DOI: 10.1016/j.engappai.2022.105090.

[14]
Lee K H, Chen X, Hua G, Hu H D, He X D. Stacked cross attention for image-text matching. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.201–216. DOI: 10.1007/978-3-030-01225-0_13.
[15]
Liu C X, Mao Z D, Liu A A, Zhang T Z, Wang B, Zhang Y D. Focus your attention: A bidirectional focal attention network for image-text matching. In Proc. the 27th ACM International Conference on Multimedia, Oct. 2019, pp.3–11. DOI: 10.1145/3343031.3350869.
[16]
Wei X, Zhang T Z, Li Y, Zhang Y D, Wu F. Multi-modality cross attention network for image and sentence matching. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.10941–10950. DOI: 10.1109/cvpr42600.2020.01095.
[17]
He Y, Liu X, Cheung Y M, Peng S J, Yi J H, Fan W T. Cross-graph attention enhanced multi-modal correlation learning for fine-grained image-text retrieval. In Proc. the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul. 2021, pp.1865–1869. DOI: 10.1145/3404835.3463031.
[18]

Zhang K, Mao Z D, Liu A A, Zhang Y D. Unified adaptive relevance distinguishable attention network for image-text matching. IEEE Trans. Multimedia , 2023, 25: 1320–1332. DOI: 10.1109/TMM.2022.3141603.

[19]
Zhang K, Mao Z D, Wang Q, Zhang Y D. Negative-aware attention framework for image-text matching. In Proc. the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2022, pp.15661–15670. DOI: 10.1109/cvpr52688.2022.01521.
[20]

Wu J, Wu C L, Lu J, Wang L Q, Cui X R. Region reinforcement network with topic constraint for image-text matching. IEEE Trans. Circuits and Systems for Video Technology , 2022, 32(1): 388–397. DOI: 10.1109/TCSVT.2021.3060713.

[21]

Wang Y, Su Y T, Li W H, Sun Z Y, Wei Z Q, Nie J, Li X Y, Liu A A. Rare-aware attention network for image–text matching. Information Processing & Management , 2023, 60(3): 103280. DOI: 10.1016/j.ipm.2023.103280.

[22]
Chen J C, Hu H X, Wu H, Jiang Y N, Wang C H. Learning the best pooling strategy for visual semantic embedding. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.15789–15798. DOI: 10.1109/cvpr46437.2021.01553.
[23]
Liu C X, Mao Z D, Zhang T Z, Xie H T, Wang B, Zhang Y D. Graph structured network for image-text matching. In Proc. the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.10921–10930. DOI: 10.1109/cvpr42600.2020.01093.
[24]

Cheng Y H, Zhu X G, Qian J C, Wen F, Liu P L. Cross-modal graph matching network for image-text retrieval. ACM Trans. Multimedia Computing, Communications, and Applications , 2022, 18(4): 95. DOI: 10.1145/3499027.

[25]
Diao H W, Zhang Y, Ma L, Lu H C. Similarity reasoning and filtration for image-text matching. In Proc. the 35th AAAI Conference on Artificial Intelligence, Feb. 2021, pp.1218–1226. DOI: 10.1609/aaai.v35i2.16209.
[26]
Wang X H, Zhu L C, Yang Y. T2VLAD: Global-local sequence alignment for text-video retrieval. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.5079–5088. DOI: 10.1109/cvpr46437.2021.00504.
[27]
Ji Z, Chen K X, Wang H R. Step-wise hierarchical alignment network for image-text matching. In Proc. the 30th International Joint Conference on Artificial Intelligence, Aug. 2021, pp.765–771. DOI: 10.24963/ijcai.2021/106.
[28]

Li W H, Yang S, Wang Y, Song D, Li X Y. Multi-level similarity learning for image-text retrieval. Information Processing & Management , 2021, 58(1): 102432. DOI: 10.1016/j.ipm.2020.102432.

[29]

Li J T, Liu L, Niu L, Zhang L Q. Memorize, associate and match: Embedding enhancement via fine-grained alignment for image-text retrieval. IEEE Trans. Image Processing , 2021, 30: 9193–9207. DOI: 10.1109/TIP.2021.3123553.

[30]

Xu Y Y, Li X T, Yuan H B, Yang Y B, Zhang L F. Multi-task learning with multi-query transformer for dense prediction. IEEE Trans. Circuits and Systems for Video Technology , 2024, 34(2): 1228–1240. DOI: 10.1109/tcsvt.2023.3292995.

[31]

Foggia P, Greco A, Saggese A, Vento M. Multi-task learning on the edge for effective gender, age, ethnicity and emotion recognition. Engineering Applications of Artificial Intelligence , 2023, 118: 105651. DOI: 10.1016/j.engappai.2022.105651.

[32]

Moscato V, Napolano G, Postiglione M, Sperlì G. Multi-task learning for few-shot biomedical relation extraction. Artificial Intelligence Review , 2023, 56(11): 13743–13763. DOI: 10.1007/s10462-023-10484-6.

[33]
Vandenhende S, Georgoulis S, Van Gool L. MTI-Net: Multi-scale task interaction networks for multi-task learning. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.527–543. DOI: 10.1007/978-3-030-58548-8_31.
[34]
Luo J Y, Shen Y, Ao X, Zhao Z, Yang M. Cross-modal image-text retrieval with multitask learning. In Proc. the 28th ACM International Conference on Information and Knowledge Management, Nov. 2019, pp.2309–2312. DOI: 10.1145/3357384.3358104.
[35]
Yuan H, Huang Y, Zhang D B, Chen Z R, Cheng W L, Wang L. VSR++: Improving visual semantic reasoning for fine-grained image-text matching. In Proc. the 25th International Conference on Pattern Recognition, Jan. 2021, pp.3728–3735. DOI: 10.1109/icpr48806.2021.9413223.
[36]

Xu X, Wang T, Yang Y, Zuo L, Shen F M, Shen H T. Cross-modal attention with semantic consistence for image–text matching. IEEE Trans. Neural Networks and Learning Systems , 2020, 31(12): 5412–5425. DOI: 10.1109/ TNNLS.2020.2967597.

[37]

Li K P, Zhang Y L, Li K, Li Y Y, Fu Y. Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Analysis and Machine Intelligence , 2023, 45(1): 641–656. DOI: 10.1109/TPAMI.2022.3148470.

[38]
Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.6077–6086. DOI: 10.1109/cvpr.2018.00636.
[39]
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: 10.1109/cvpr.2016.90.
Journal of Computer Science and Technology
Pages 811-826
Cite this article:
Qin X-Y, Li L-S, Tang J-Y, et al. Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval. Journal of Computer Science and Technology, 2024, 39(4): 811-826. https://doi.org/10.1007/s11390-024-4125-1

143

Views

4

Crossref

3

Web of Science

3

Scopus

0

CSCD

Altmetrics

Received: 16 January 2024
Accepted: 20 June 2024
Published: 20 September 2024
© Institute of Computing Technology, Chinese Academy of Sciences 2024
Return