Feature-Grounded Single-Stage Text-to-Image Generation

Yuan Zhou; Peng Wang; Lei Xiang; Haofeng Zhang

doi:10.26599/TST.2023.9010023

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (18.7 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Feature-Grounded Single-Stage Text-to-Image Generation

Yuan Zhou^¹(

), Peng Wang^¹, Lei Xiang^¹, Haofeng Zhang^²

1School of Artificial Intelligence, Nanjing University of Information Science and Technology, Nanjing 210044, China

2School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China

Show Author Information

Abstract

Recently, Generative Adversarial Networks (GANs) have become the mainstream text-to-image (T2I) framework. However, a standard normal distribution noise of inputs cannot provide sufficient information to synthesize an image that approaches the ground-truth image distribution. Moreover, the multistage generation strategy results in complex T2I applications. Therefore, this study proposes a novel feature-grounded single-stage T2I model, which considers the "real" distribution learned from training images as one input and introduces a worst-case-optimized similarity measure into the loss function to enhance the model’s generation capacity. Experimental results on two benchmark datasets demonstrate the competitive performance of the proposed model in terms of the Frechet inception distance and inception score compared to those of some classical and state-of-the-art models, showing the improved similarities among the generated image, text, and ground truth.

Keywords

Generative Adversarial Network (GAN)text-to-image (T2I)feature-grounded single-stage generation

References

[1]

X. Wu, K. Xu, and P. Hall, A survey of image synthesis and editing with generative adversarial networks, Tsinghua Science and Technology, vol. 22, no. 6, pp. 660–674, 2017.

Crossref Google Scholar

[2]

T. Qiao, J. Zhang, D. Xu, and D. Tao, MirrorGAN: Learning text-to-image generation by redescription, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1505–1514.

Crossref Google Scholar

[3]

G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, Semantics disentangling for text-to-image generation, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 2322–2331.

Crossref Google Scholar

[4]

M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al., CogView: Mastering text-to-image generation via transformers, arXiv preprint arXiv: 2105.13290, 2021.

Google Scholar

[5]

Z. Qi, J. Sun, J. Qian, J. Xu, and S. Zhan, PCCM-GAN: Photographic text-to-image generation with pyramid contrastive consistency model, Neurocomputing, vol. 449, pp. 330–341, 2021.

Crossref Google Scholar

[6]

Z. Zhang and L. Schomaker, DiverGAN: An efficient and effective single-stage framework for diverse text-to-image generation, Neurocomputing, vol. 473, pp. 182–198, 2022.

Crossref Google Scholar

[7]

W. Xia, Y. Yang, J. H. Xue, and B. Wu, TediGAN: Text-guided diverse face image generation and manipulation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 2256–2265.

Crossref Google Scholar

[8]

A. Elgammal, B. Liu, M. Elhoseiny, and M. Mazzone, CAN: Creative adversarial networks, generating “art” by learning about styles and deviating from style norms, arXiv preprint arXiv: 1706.07068, 2017.

Google Scholar

[9]

Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal, A generative adversarial approach for zero-shot learning from noisy texts, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1004–1013.

Crossref Google Scholar

[10]

I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Proc. 27^th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672–2680.

Google Scholar

[11]

G. Adorni, M. Di Manzo, and F. Giunchiglia, Natural language driven image generation, in Proc. 10^th Int. Conf. Computational Linguistics and 22^nd Annu. Meeting of the Association for Computational Linguistics, Stanford, CA, USA, 1984, pp. 495–500.

Crossref Google Scholar

[12]

A. Yamada, T. Yamamoto, H. Ikeda, T. Nishida, and S. Doshita, Reconstructing spatial image from natural language texts, in Proc. 14^th Conf. Computational Linguistics, Nantes, France, 1992, pp. 1279–1283.

Crossref Google Scholar

[13]

S. R. Clay and J. Wilhelms, Put: Language-based interactive manipulation of objects, IEEE Comput. Grap. Appl., vol. 16, no. 2, pp. 31–39, 1996.

Crossref Google Scholar

[14]

B. Coyne and R. Sproat, WordsEye: An automatic text-to-scene conversion system, in Proc. 28^th Annu. Conf. Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 2001, pp. 487–496.

Crossref Google Scholar

[15]

R. Johansson, A. Berglund, M. Danielsson, and P. Nugues, Automatic text-to-scene conversion in the traffic accident domain, in Proc. 19^th Int. Joint Conf. Artificial Intelligence, Edinburgh, UK, 2005, pp. 1073–1078.

Google Scholar

[16]

X. Zhu, A. B. Goldberg, M. Eldawy, C. R. Dyer, and B. Strock, A text-to-picture synthesis system for augmenting communication, in Proc. 22^nd National Conf. Artificial Intelligence, Vancouver, Canada, 2007, pp. 1590–1595.

Google Scholar

[17]

J. Agnese, J. Herrera, H. Tao, and X. Zhu, A survey and taxonomy of adversarial neural networks for text-to-image synthesis, WIRs: Data Mining and Knowledge Discovery, vol. 10, no. 4, p. e1345, 2020.

Crossref Google Scholar

[18]

D. P. Kingma and M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv: 1312.6114, 2022.

Google Scholar

[19]

E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov, Generating images from captions with attention, arXiv preprint arXiv: 1511.02793, 2016.

Google Scholar

[20]

X. Yan, J. Yang, K. Sohn, and H. Lee, Attribute2Image: Conditional image generation from visual attributes, in Proc. 14^th European Conf. Computer Vision, Amsterdam, the Netherlands, 2016, pp. 776–791.

Crossref Google Scholar

[21]

M. Mirza and S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv: 1411.1784, 2014.

Google Scholar

[22]

G. Antipov, M. Baccouche, and J. L. Dugelay, Face aging with conditional generative adversarial networks, in Proc. 2017 IEEE Int. Conf. Image Processing (ICIP), Beijing, China, 2017, pp. 2089–2093.

Crossref Google Scholar

[23]

S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, Generative adversarial text to image synthesis, in Proc. 33rd Int. Conf. Machine Learning, New York, NY, USA, 2016, pp. 1060–1069.

Google Scholar

[24]

A. Odena, C. Olah, and J. Shlens, Conditional image synthesis with auxiliary classifier GANs, in Proc. 34^th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 2642–2651.

Google Scholar

[25]

H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, StackGAN++: Realistic image synthesis with stacked generative adversarial networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1947–1962, 2019.

Crossref Google Scholar

[26]

M. Cha, Y. L. Gwon, and H. T. Kung, Adversarial learning of semantic relevance in text to image synthesis, in Proc. 33^rd AAAI Conf. Artificial Intelligence and 31^st Innovative Applications of Artificial Intelligence Conf. and 9^th AAAI Symp. Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 2019, pp. 3272–3279.

Crossref Google Scholar

[27]

D. M. Souza, J. Wehrmann, and D. D. Ruiz, Efficient neural architecture for text-to-image synthesis, in Proc. 2020 Int. Joint Conf. Neural Networks (IJCNN), Glasgow, UK, 2020, pp. 1–8.

Crossref Google Scholar

[28]

Y. Yang, L. Wang, D. Xie, C. Deng, and D. Tao, Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis, IEEE Trans. Image Process., vol. 30, pp. 2798–2809, 2021.

Crossref Google Scholar

[29]

H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 5908–5916.

Crossref Google Scholar

[30]

T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1316–1324.

Crossref Google Scholar

[31]

H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang, Cross-modal contrastive learning for text-to-image generation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 833–842.

Crossref Google Scholar

[32]

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models, arXiv preprint arXiv: 2112.10741, 2022.

Google Scholar

[33]

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, Hierarchical text-conditional image generation with CLIP latents, arXiv preprint arXiv: 2204.06125, 2022.

Google Scholar

[34]

C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al., Photorealistic text-to-image diffusion models with deep language understanding, arXiv preprint arXiv: 2205.11487, 2022.

Crossref Google Scholar

[35]

J. Liu, H. Bai, H. Zhang, and L. Liu, Near-real feature generative network for generalized zero-shot learning, in Proc. 2021 IEEE Int. Conf. Multimedia and Expo (ICME), Shenzhen, China, 2021, pp. 1–6.

Crossref Google Scholar

[36]

S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, Learning what and where to draw, in Proc. 30^th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 217–225.

Google Scholar

[37]

Z. Zhang, Y. Xie, and L. Yang, Photographic text-to-image synthesis with a hierarchically-nested adversarial network, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6199–6208.

Crossref Google Scholar

[38]

M. Tao, H. Tang, S. Wu, N. Sebe, F. Wu, and X. Y. Jing, DF-GAN: Deep fusion generative adversarial networks for text-to-image synthesis, arXiv preprint arXiv: 2008.05865, 2022.

Google Scholar

[39]

J. Cheng, F. Wu, Y. Tian, L. Wang, and D. Tao, RiFeGAN: Rich feature generation for text-to-image synthesis from prior knowledge, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10908–10917.

Crossref Google Scholar

[40]

S. Ruan, Y. Zhang, K. Zhang, Y. Fan, F. Tang, Q. Liu, and E. Chen, DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 13940–13949.

Crossref Google Scholar

[41]

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in Proc. 38^th Int. Conf. Machine Learning, Virtual Event, 2021, pp. 8748–8763.

Google Scholar

[42]

A. Brock, J. Donahue, and K. Simonyan, Large scale GAN training for high fidelity natural image synthesis, arXiv preprint arXiv: 1809.11096, 2019.

Google Scholar

[43]

C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, The CALTECH-UCSD birds-200-2011 dataset, http://www.vision.caltech.edu/datasets/cub_200_2011, 2011.

[44]

T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13^th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.

Crossref Google Scholar

[45]

T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, Improved techniques for training GANs, in Proc. 30^th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 2234–2242.

Google Scholar

[46]

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, GANs trained by a two time-scale update rule converge to a local Nash equilibrium, in Proc. 31^st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6629–6640.

Google Scholar

Tsinghua Science and Technology

Volume 29 Issue 2,
April 2024

Pages 469-480

DOI: 10.26599/TST.2023.9010023

Cite this article:

Zhou Y, Wang P, Xiang L, et al. Feature-Grounded Single-Stage Text-to-Image Generation. Tsinghua Science and Technology, 2024, 29(2): 469-480. https://doi.org/10.26599/TST.2023.9010023

293

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 16 November 2022

Revised: 22 March 2023

Accepted: 26 March 2023

Published: 22 September 2023

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).