Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Image-text retrieval aims to capture the semantic correspondence between images and texts, which serves as a foundation and crucial component in multi-modal recommendations, search systems, and online shopping. Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval. To this end, a multi-task visual semantic embedding network (MVSEN) is proposed for image-text retrieval. Specifically, we design two auxiliary tasks, including text-text matching and multi-label classification, for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective. Besides, we present an intra- and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities. Subsequently, we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs. Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets, Flickr30K and MSCOCO, with rSum improvements of 8.2% and 3.0%, respectively.
Zhao G S, Zhang C F, Shang H, Wang Y X, Zhu L, Qian X M. Generative label fused network for image-text matching. Knowledge-Based Systems , 2023, 263: 110280. DOI: 10.1016/j.knosys.2023.110280.
Qin X Y, Li L S, Hao F, Pang G Y, Wang Z H. Cross-modal information balance-aware reasoning network for image-text retrieval. Engineering Applications of Artificial Intelligence , 2023, 120: 105923. DOI: 10.1016/j.engappai.2023.105923.
Liu K, Xue F, Guo D, Sun P J, Qian S S, Hong R C. Multimodal graph contrastive learning for multimedia-based recommendation. IEEE Trans. Multimedia , 2023, 25: 9343–9355. DOI: 10.1109/TMM.2023.3251108.
Wu Y X, Liao L Z, Zhang G Y, Lei W Q, Zhao G S, Qian X M, Chua T S. State graph reasoning for multimodal conversational recommendation. IEEE Trans. Multimedia , 2023, 25: 3113–3124. DOI: 10.1109/TMM.2022.3155900.
Wen Z, Peng Y X. Multi-level knowledge injecting for visual commonsense reasoning. IEEE Trans. Circuits and Systems for Video Technology , 2021, 31(3): 1042–1054. DOI: 10.1109/TCSVT.2020.2991866.
Li Z Y, Guo Y Y, Wang K J, Wei Y W, Nie L Q, Kankanhalli M. Joint answering and explanation for visual commonsense reasoning. IEEE Trans. Image Processing , 2023, 32: 3836–3846. DOI: 10.1109/TIP.2023.3286259.
Wang L W, Li Y, Huang J, Lazebnik S. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Analysis and Machine Intelligence , 2019, 41(2): 394–407. DOI: 10.1109/TPAMI.2018.2797921.
Liu Y, Guo Y M, Liu L, Bakker E M, Lew M S. CycleMatch: A cycle-consistent embedding network for image-text matching. Pattern Recognition , 2019, 93: 365–379. DOI: 10.1016/j.patcog.2019.05.008.
Peng Y X, Qi J W. CM-GANs: Cross-modal generative adversarial networks for common representation learning. ACM Trans. Multimedia Computing, Communications, and Applications , 2019, 15(1): Article No. 22. DOI: 10.1145/3284750.
Chi J Z, Peng Y X. Zero-shot cross-media embedding learning with dual adversarial distribution network. IEEE Trans. Circuits and Systems for Video Technology , 2020, 30(4): 1173–1187. DOI: 10.1109/TCSVT.2019.2900171.
Xie Y C, Zeng X H, Wang T H, Xu L M, Wang D J. Multiple deep neural networks with multiple labels for cross-modal hashing retrieval. Engineering Applications of Artificial Intelligence , 2022, 114: 105090. DOI: 10.1016/j.engappai.2022.105090.
Zhang K, Mao Z D, Liu A A, Zhang Y D. Unified adaptive relevance distinguishable attention network for image-text matching. IEEE Trans. Multimedia , 2023, 25: 1320–1332. DOI: 10.1109/TMM.2022.3141603.
Wu J, Wu C L, Lu J, Wang L Q, Cui X R. Region reinforcement network with topic constraint for image-text matching. IEEE Trans. Circuits and Systems for Video Technology , 2022, 32(1): 388–397. DOI: 10.1109/TCSVT.2021.3060713.
Wang Y, Su Y T, Li W H, Sun Z Y, Wei Z Q, Nie J, Li X Y, Liu A A. Rare-aware attention network for image–text matching. Information Processing & Management , 2023, 60(3): 103280. DOI: 10.1016/j.ipm.2023.103280.
Cheng Y H, Zhu X G, Qian J C, Wen F, Liu P L. Cross-modal graph matching network for image-text retrieval. ACM Trans. Multimedia Computing, Communications, and Applications , 2022, 18(4): 95. DOI: 10.1145/3499027.
Li W H, Yang S, Wang Y, Song D, Li X Y. Multi-level similarity learning for image-text retrieval. Information Processing & Management , 2021, 58(1): 102432. DOI: 10.1016/j.ipm.2020.102432.
Li J T, Liu L, Niu L, Zhang L Q. Memorize, associate and match: Embedding enhancement via fine-grained alignment for image-text retrieval. IEEE Trans. Image Processing , 2021, 30: 9193–9207. DOI: 10.1109/TIP.2021.3123553.
Xu Y Y, Li X T, Yuan H B, Yang Y B, Zhang L F. Multi-task learning with multi-query transformer for dense prediction. IEEE Trans. Circuits and Systems for Video Technology , 2024, 34(2): 1228–1240. DOI: 10.1109/tcsvt.2023.3292995.
Foggia P, Greco A, Saggese A, Vento M. Multi-task learning on the edge for effective gender, age, ethnicity and emotion recognition. Engineering Applications of Artificial Intelligence , 2023, 118: 105651. DOI: 10.1016/j.engappai.2022.105651.
Moscato V, Napolano G, Postiglione M, Sperlì G. Multi-task learning for few-shot biomedical relation extraction. Artificial Intelligence Review , 2023, 56(11): 13743–13763. DOI: 10.1007/s10462-023-10484-6.
Xu X, Wang T, Yang Y, Zuo L, Shen F M, Shen H T. Cross-modal attention with semantic consistence for image–text matching. IEEE Trans. Neural Networks and Learning Systems , 2020, 31(12): 5412–5425. DOI: 10.1109/ TNNLS.2020.2967597.
Li K P, Zhang Y L, Li K, Li Y Y, Fu Y. Image-text embedding learning via visual and textual semantic reasoning. IEEE Trans. Pattern Analysis and Machine Intelligence , 2023, 45(1): 641–656. DOI: 10.1109/TPAMI.2022.3148470.