AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (20.8 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

LucIE: Language-guided local image editing for fashion images

School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Computer Vision Research Group in the Institute of Informatics, University of Amsterdam, Amsterdam, the Netherlands
Show Author Information

Graphical Abstract

Abstract

Language-guided fashion image editing is challenging, as fashion image editing is local and requires high precision, while natural language cannot provide precise visual information for guidance. In this paper, we propose LucIE, a novel unsupervised language-guided local image editing method for fashion images. LucIE adopts and modifies recent text-to-image synthesis network, DF-GAN, as its backbone. However, the synthesis backbone often changes the global structure of the input image, making local image editing impractical. To increase structural consistency between input and edited images, we propose Content-Preserving Fusion Module (CPFM). Different from existing fusion modules, CPFM prevents iterative refinement on visual feature maps and accumulates additive modifications on RGB maps. LucIE achieves local image editing explicitly with language-guided image segmentation and mask-guided image blending while only using image and text pairs. Results on the DeepFashion dataset shows that LucIE achieves state-of-the-art results. Compared with previous methods, images generated by LucIE also exhibit fewer artifacts. We provide visualizations and perform ablation studies to validate LucIE and the CPFM. We also demonstrate and analyze limitations of LucIE, to provide a better understanding of LucIE.

References

[1]
Park, T.; Liu, M. Y.; Wang, T. C.; Zhu, J. Y. Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2332–2341, 2019.
[2]

Portenier, T.; Hu, Q.; Szabó, A.; Bigdeli, S.; Favaro P, Zwicker M. Faceshop: Deep sketch-based face image editing. ACM Transactions on Graphics Vol. 37, No. 4, Article No. 99, 2018.

[3]
Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J. Y.; Ermon, S. SDEdit: Guided image synthesis and editing with stochastic differential equations. In: Proceedings of the International Conference on Learning Representations, 2022.
[4]
Dekel, T.; Gan, C.; Krishnan, D.; Liu, C.; Freeman, W. T. Sparse, smart contours to represent and edit images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3511–3520, 2018.
[5]
Dong, H.; Yu, S.; Wu, C.; Guo, Y. Semantic image synthesis via adversarial learning. In: Proceedings of the IEEE International Conference on Computer Vision, 5707–5715, 2017.
[6]
Nam, S.; Kim, Y.; Kim, S. J. Text-adaptive generative adversarial networks: Manipulating images with natural language. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 42–51, 2018.
[7]
Li, B.; Qi, X.; Lukasiewicz, T.; Torr, P. H. S. ManiGAN: Text-guided image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7877–7886, 2020.
[8]
Xia, W.; Yang, Y.; Xue, J. H.; Wu, B. TediGAN: Text-guided diverse face image generation and manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2256–2265, 2021.
[9]
Yang, H.; Zhang, R.; Guo, X.; Liu, W.; Zuo, W.; Luo, P. Towards photo-realistic virtual try-on by adaptively Generating Preserving image content. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7847–7856, 2020.
[10]
Han, X.; Wu, Z.; Wu, Z.; Yu, R.; Davis, L. S. VITON: An image-based virtual try-on network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7543–7552, 2018.
[11]
Han, Y.; Yang, S.; Wang, W.; Liu, J. From design draft to real attire: Unaligned fashion image translation. In: Proceedings of the 28th ACM International Conference on Multimedia, 1533–1541, 2020.
[12]
Cheng, Y.; Gan, Z.; Li, Y.; Liu, J.; Gao, J. Sequential attention GAN for interactive image editing. In: Proceedings of the 28th ACM International Conference on Multimedia, 4383–4391, 2020.
[13]
Jetchev, N.; Bergmann, U. The conditional analogy GAN: Swapping fashion articles on people images. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 2287–2292, 2017.
[14]
Yu, R.; Wang, X.; Xie, X. VTNFP: An image-based virtual try-on network with body and clothing feature preservation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10510–10519, 2019.
[15]
Mao, X.; Chen, Y.; Li, Y.; Xiong, T.; He, Y.; Xue, H. Bilinear representation for language-based image editing using conditional generative adversarial networks. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2047–2051, 2019.
[16]
Zhu, S.; Fidler, S.; Urtasun, R.; Lin, D.; Loy, C. C. Be your own prada: Fashion synthesis with structural coherence In: Proceedings of the IEEE International Conference on Computer Vision, 1689–1697, 2017.
[17]
Günel, M.; Erdem, E.; Erdem, A. Language Guided fashion image manipulation with feature-wise transformations. arXiv preprint arXiv: 1808.04000, 2018.
[18]
Tao, M.; Tang, H.; Wu, F.; Jing, X.; Bao, B. K.; Xu, C. DF-GAN: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16494–16504, 2022.
[19]

Perez, E.; Strub, F.; De Vries, H.; Dumoulin, V.; Courville, A. FiLM: Visual reasoning with a general conditioning layer. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 32, No. 1, 3942–3951, 2018.

[20]

Zhang, Y.; Li, L.; Song, L.; Xie, R.; Zhang, W. FACT: Fused attention for clothing transfer with generative adversarial networks. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34, No. 7, 12894–12901, 2020.

[21]
Liu, Y.; De Nadai, M.; Cai, D.; Li, H.; Alameda-Pineda, X.; Sebe, N.; Lepri, B. Describe what to change: A text-guided unsupervised image-to-image translation approach. In: Proceedings of the 28th ACM International Conference on Multimedia, 1357–1365, 2020.
[22]
Li, Y.; Min, M. R.; Shen, D.; Carlson, D.; Carin, L. Video generation from text. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence and 30th Innovative Applications of Artificial Intelligence Conference and 8th AAAI Symposium on Educational Advances in Artificial Intelligence, Article No. 865, 7065–7072, 2018.
[23]
Chen, J.; Shen, Y.; Gao, J.; Liu, J.; Liu, X. Language-based image editing with recurrent attentive models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8721–8729, 2018.
[24]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6629–6640, 2017.
[25]
Bińkowski, M.; Sutherland, D. J.; Arbel, M.; Gretton, A. Demystifying MMD GANs. In: Proceedings of the International Conference on Learning Representations, 2018.
[26]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training GANs. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, 2234–2242, 2016.
[27]
Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2672–2680, 2014.
[28]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv: 1411.1784, 2014.
[29]

Ge, S.; Jin, X.; Ye, Q.; Luo, Z.; Li, Q. Image editing by object-aware optimal boundary searching and mixed-domain composition. Computational Visual Media Vol. 4, No. 1, 71–82, 2018.

[30]

Sun, R.; Huang, C.; Zhu, H.; Ma, L. Mask-aware photorealistic facial attribute manipulation. Computational Visual Media Vol. 7, No. 3, 363–374, 2021.

[31]

Zheng, Z. H.; Zhang, H. T.; Zhang, F. L.; Mu, T. J. Image-based clothes changing system. Computational Visual Media Vol. 3, No. 4, 337–347, 2017.

[32]

Xue, Y.; Guo, Y. C.; Zhang, H.; Xu, T.; Zhang, S. H.; Huang, X. Deep image synthesis from intuitive user input: A review and perspectives. Computational Visual Media Vol. 8, No. 1, 3–31, 2022.

[33]

Mao, F.; Ma, B.; Chang, H.; Shan, S.; Chen, X. Learning efficient text-to-image synthesis via interstage cross-sample similarity distillation. Science China Information Sciences Vol. 64, No. 2, Article No. 120102, 2020.

[34]
Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In: Proceedings of the 33rd International Conference on Machine Learning, 1060–1069, 2016.
[35]
Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, 5908–5916, 2017.
[36]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316–1324, 2018.
[37]
Zhang, H.; Koh, J. Y.; Baldridge, J.; Lee, H.; Yang, Y. Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 833–842, 2021.
[38]
Yu, X.; Chen, Y.; Li, T.; Liu, S.; Li, G. Multi-mapping image-to-image translation via learning disentanglement. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article No. 269, 2994–3004, 2019.
[39]
Liu, X.; Lin, Z.; Zhang, J.; Zhao, H.; Tran, Q.; Wang, X.; Li, H. Open-edit: Open-domain image manipulation with open-vocabulary instructions. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12356. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 89–106, 2020.
[40]
Kim, J. H.; On, K. W.; Lim, W.; Kim, J.; Ha, J. W.; Zhang, B. T. Hadamard product for low-rank bilinear pooling. In: Proceedings of the International Conference on Learning Representations, 2017.
[41]
Kenan, E. A.; Sun, Y.; Lim, J. H. Learning cross-modal representations for language-based image manipulation. In: Proceedings of the IEEE International Conference on Image Processing, 1601–1605, 2020.
[42]
Vo, N.; Jiang, L.; Sun, C.; Murphy, K.; Li, L. J.; Li, F. F.; Hays, J. Composing text and image for image retrieval–-An empirical odyssey. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6432–6441, 2019.
[43]
Vaswani, A.; Shazeer, N. M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 2017.
[44]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.
[45]
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, 8748–8763, 2021.
[46]
Maas, A. L.; Hannun, A. Y.; Ng, A. Y. Rectifier nonlinearities improve neural network acoustic models. In: Proceedings of the 30th International Conference on Machine Learning, 2013.
[47]
Zhao, S.; Liu, Z.; Lin, J.; Zhu, J. Y.; Han, S. Differentiable augmentation for data-efficient GAN training. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, Article No. 634, 7559–7570, 2020.
[48]
Kingma, D. P.; Ba, J.; Hammad, M. M. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations, 2015.
Computational Visual Media
Pages 179-194

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Wen H, You S, Fu Y. LucIE: Language-guided local image editing for fashion images. Computational Visual Media, 2025, 11(1): 179-194. https://doi.org/10.26599/CVM.2025.9450310

191

Views

10

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 28 June 2022
Accepted: 04 September 2022
Published: 28 February 2025
© The Author(s) 2025.

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

To submit a manuscript, please go to https://jcvm.org.