AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (18.7 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Feature-Grounded Single-Stage Text-to-Image Generation

School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing 210044, China
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
Show Author Information

Abstract

Recently, Generative Adversarial Networks (GANs) have become the mainstream text-to-image (T2I) framework. However, a standard normal distribution noise of inputs cannot provide sufficient information to synthesize an image that approaches the ground-truth image distribution. Moreover, the multistage generation strategy results in complex T2I applications. Therefore, this study proposes a novel feature-grounded single-stage T2I model, which considers the "real" distribution learned from training images as one input and introduces a worst-case-optimized similarity measure into the loss function to enhance the model’s generation capacity. Experimental results on two benchmark datasets demonstrate the competitive performance of the proposed model in terms of the Frechet inception distance and inception score compared to those of some classical and state-of-the-art models, showing the improved similarities among the generated image, text, and ground truth.

References

[1]
X. Wu, K. Xu, and P. Hall, A survey of image synthesis and editing with generative adversarial networks, Tsinghua Science and Technology, vol. 22, no. 6, pp. 660–674, 2017.
[2]
T. Qiao, J. Zhang, D. Xu, and D. Tao, MirrorGAN: Learning text-to-image generation by redescription, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1505–1514.
[3]
G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, Semantics disentangling for text-to-image generation, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 2322–2331.
[4]
M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al., CogView: Mastering text-to-image generation via transformers, arXiv preprint arXiv: 2105.13290, 2021.
[5]
Z. Qi, J. Sun, J. Qian, J. Xu, and S. Zhan, PCCM-GAN: Photographic text-to-image generation with pyramid contrastive consistency model, Neurocomputing, vol. 449, pp. 330–341, 2021.
[6]
Z. Zhang and L. Schomaker, DiverGAN: An efficient and effective single-stage framework for diverse text-to-image generation, Neurocomputing, vol. 473, pp. 182–198, 2022.
[7]
W. Xia, Y. Yang, J. H. Xue, and B. Wu, TediGAN: Text-guided diverse face image generation and manipulation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 2256–2265.
[8]
A. Elgammal, B. Liu, M. Elhoseiny, and M. Mazzone, CAN: Creative adversarial networks, generating “art” by learning about styles and deviating from style norms, arXiv preprint arXiv: 1706.07068, 2017.
[9]
Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal, A generative adversarial approach for zero-shot learning from noisy texts, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1004–1013.
[10]
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672–2680.
[11]
G. Adorni, M. Di Manzo, and F. Giunchiglia, Natural language driven image generation, in Proc. 10th Int. Conf. Computational Linguistics and 22nd Annu. Meeting of the Association for Computational Linguistics, Stanford, CA, USA, 1984, pp. 495–500.
[12]
A. Yamada, T. Yamamoto, H. Ikeda, T. Nishida, and S. Doshita, Reconstructing spatial image from natural language texts, in Proc. 14th Conf. Computational Linguistics, Nantes, France, 1992, pp. 1279–1283.
[13]
S. R. Clay and J. Wilhelms, Put: Language-based interactive manipulation of objects, IEEE Comput. Grap. Appl., vol. 16, no. 2, pp. 31–39, 1996.
[14]
B. Coyne and R. Sproat, WordsEye: An automatic text-to-scene conversion system, in Proc. 28th Annu. Conf. Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 2001, pp. 487–496.
[15]
R. Johansson, A. Berglund, M. Danielsson, and P. Nugues, Automatic text-to-scene conversion in the traffic accident domain, in Proc. 19th Int. Joint Conf. Artificial Intelligence, Edinburgh, UK, 2005, pp. 1073–1078.
[16]
X. Zhu, A. B. Goldberg, M. Eldawy, C. R. Dyer, and B. Strock, A text-to-picture synthesis system for augmenting communication, in Proc. 22nd National Conf. Artificial Intelligence, Vancouver, Canada, 2007, pp. 1590–1595.
[17]
J. Agnese, J. Herrera, H. Tao, and X. Zhu, A survey and taxonomy of adversarial neural networks for text-to-image synthesis, WIRs: Data Mining and Knowledge Discovery, vol. 10, no. 4, p. e1345, 2020.
[18]
D. P. Kingma and M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv: 1312.6114, 2022.
[19]
E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov, Generating images from captions with attention, arXiv preprint arXiv: 1511.02793, 2016.
[20]
X. Yan, J. Yang, K. Sohn, and H. Lee, Attribute2Image: Conditional image generation from visual attributes, in Proc. 14th European Conf. Computer Vision, Amsterdam, the Netherlands, 2016, pp. 776–791.
[21]
M. Mirza and S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv: 1411.1784, 2014.
[22]
G. Antipov, M. Baccouche, and J. L. Dugelay, Face aging with conditional generative adversarial networks, in Proc. 2017 IEEE Int. Conf. Image Processing (ICIP), Beijing, China, 2017, pp. 2089–2093.
[23]
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, Generative adversarial text to image synthesis, in Proc. 33rd Int. Conf. Machine Learning, New York, NY, USA, 2016, pp. 1060–1069.
[24]
A. Odena, C. Olah, and J. Shlens, Conditional image synthesis with auxiliary classifier GANs, in Proc. 34th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 2642–2651.
[25]
H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, StackGAN++: Realistic image synthesis with stacked generative adversarial networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1947–1962, 2019.
[26]
M. Cha, Y. L. Gwon, and H. T. Kung, Adversarial learning of semantic relevance in text to image synthesis, in Proc. 33rd AAAI Conf. Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conf. and 9th AAAI Symp. Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 2019, pp. 3272–3279.
[27]
D. M. Souza, J. Wehrmann, and D. D. Ruiz, Efficient neural architecture for text-to-image synthesis, in Proc. 2020 Int. Joint Conf. Neural Networks (IJCNN), Glasgow, UK, 2020, pp. 1–8.
[28]
Y. Yang, L. Wang, D. Xie, C. Deng, and D. Tao, Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis, IEEE Trans. Image Process., vol. 30, pp. 2798–2809, 2021.
[29]
H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 5908–5916.
[30]
T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1316–1324.
[31]
H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang, Cross-modal contrastive learning for text-to-image generation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 833–842.
[32]
A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models, arXiv preprint arXiv: 2112.10741, 2022.
[33]
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, Hierarchical text-conditional image generation with CLIP latents, arXiv preprint arXiv: 2204.06125, 2022.
[34]
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al., Photorealistic text-to-image diffusion models with deep language understanding, arXiv preprint arXiv: 2205.11487, 2022.
[35]
J. Liu, H. Bai, H. Zhang, and L. Liu, Near-real feature generative network for generalized zero-shot learning, in Proc. 2021 IEEE Int. Conf. Multimedia and Expo (ICME), Shenzhen, China, 2021, pp. 1–6.
[36]
S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, Learning what and where to draw, in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 217–225.
[37]
Z. Zhang, Y. Xie, and L. Yang, Photographic text-to-image synthesis with a hierarchically-nested adversarial network, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6199–6208.
[38]
M. Tao, H. Tang, S. Wu, N. Sebe, F. Wu, and X. Y. Jing, DF-GAN: Deep fusion generative adversarial networks for text-to-image synthesis, arXiv preprint arXiv: 2008.05865, 2022.
[39]
J. Cheng, F. Wu, Y. Tian, L. Wang, and D. Tao, RiFeGAN: Rich feature generation for text-to-image synthesis from prior knowledge, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10908–10917.
[40]
S. Ruan, Y. Zhang, K. Zhang, Y. Fan, F. Tang, Q. Liu, and E. Chen, DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 13940–13949.
[41]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in Proc. 38th Int. Conf. Machine Learning, Virtual Event, 2021, pp. 8748–8763.
[42]
A. Brock, J. Donahue, and K. Simonyan, Large scale GAN training for high fidelity natural image synthesis, arXiv preprint arXiv: 1809.11096, 2019.
[43]
C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, The CALTECH-UCSD birds-200-2011 dataset, http://www.vision.caltech.edu/datasets/cub_200_2011, 2011.
[44]
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.
[45]
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, Improved techniques for training GANs, in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 2234–2242.
[46]
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, GANs trained by a two time-scale update rule converge to a local Nash equilibrium, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6629–6640.
Tsinghua Science and Technology
Pages 469-480
Cite this article:
Zhou Y, Wang P, Xiang L, et al. Feature-Grounded Single-Stage Text-to-Image Generation. Tsinghua Science and Technology, 2024, 29(2): 469-480. https://doi.org/10.26599/TST.2023.9010023

293

Views

43

Downloads

1

Crossref

1

Web of Science

1

Scopus

0

CSCD

Altmetrics

Received: 16 November 2022
Revised: 22 March 2023
Accepted: 26 March 2023
Published: 22 September 2023
© The author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return