X. Wu, K. Xu, and P. Hall, A survey of image synthesis and editing with generative adversarial networks, Tsinghua Science and Technology, vol. 22, no. 6, pp. 660–674, 2017.
T. Qiao, J. Zhang, D. Xu, and D. Tao, MirrorGAN: Learning text-to-image generation by redescription, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 1505–1514.
G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, Semantics disentangling for text-to-image generation, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 2322–2331.
M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou, Z. Shao, H. Yang, et al., CogView: Mastering text-to-image generation via transformers, arXiv preprint arXiv: 2105.13290, 2021.
Z. Qi, J. Sun, J. Qian, J. Xu, and S. Zhan, PCCM-GAN: Photographic text-to-image generation with pyramid contrastive consistency model, Neurocomputing, vol. 449, pp. 330–341, 2021.
Z. Zhang and L. Schomaker, DiverGAN: An efficient and effective single-stage framework for diverse text-to-image generation, Neurocomputing, vol. 473, pp. 182–198, 2022.
W. Xia, Y. Yang, J. H. Xue, and B. Wu, TediGAN: Text-guided diverse face image generation and manipulation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 2256–2265.
A. Elgammal, B. Liu, M. Elhoseiny, and M. Mazzone, CAN: Creative adversarial networks, generating “art” by learning about styles and deviating from style norms, arXiv preprint arXiv: 1706.07068, 2017.
Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal, A generative adversarial approach for zero-shot learning from noisy texts, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1004–1013.
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 2672–2680.
G. Adorni, M. Di Manzo, and F. Giunchiglia, Natural language driven image generation, in Proc. 10th Int. Conf. Computational Linguistics and 22nd Annu. Meeting of the Association for Computational Linguistics, Stanford, CA, USA, 1984, pp. 495–500.
A. Yamada, T. Yamamoto, H. Ikeda, T. Nishida, and S. Doshita, Reconstructing spatial image from natural language texts, in Proc. 14th Conf. Computational Linguistics, Nantes, France, 1992, pp. 1279–1283.
S. R. Clay and J. Wilhelms, Put: Language-based interactive manipulation of objects, IEEE Comput. Grap. Appl., vol. 16, no. 2, pp. 31–39, 1996.
B. Coyne and R. Sproat, WordsEye: An automatic text-to-scene conversion system, in Proc. 28th Annu. Conf. Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 2001, pp. 487–496.
R. Johansson, A. Berglund, M. Danielsson, and P. Nugues, Automatic text-to-scene conversion in the traffic accident domain, in Proc. 19th Int. Joint Conf. Artificial Intelligence, Edinburgh, UK, 2005, pp. 1073–1078.
X. Zhu, A. B. Goldberg, M. Eldawy, C. R. Dyer, and B. Strock, A text-to-picture synthesis system for augmenting communication, in Proc. 22nd National Conf. Artificial Intelligence, Vancouver, Canada, 2007, pp. 1590–1595.
J. Agnese, J. Herrera, H. Tao, and X. Zhu, A survey and taxonomy of adversarial neural networks for text-to-image synthesis, WIRs: Data Mining and Knowledge Discovery, vol. 10, no. 4, p. e1345, 2020.
D. P. Kingma and M. Welling, Auto-encoding variational bayes, arXiv preprint arXiv: 1312.6114, 2022.
E. Mansimov, E. Parisotto, J. L. Ba, and R. Salakhutdinov, Generating images from captions with attention, arXiv preprint arXiv: 1511.02793, 2016.
X. Yan, J. Yang, K. Sohn, and H. Lee, Attribute2Image: Conditional image generation from visual attributes, in Proc. 14th European Conf. Computer Vision, Amsterdam, the Netherlands, 2016, pp. 776–791.
M. Mirza and S. Osindero, Conditional generative adversarial nets, arXiv preprint arXiv: 1411.1784, 2014.
G. Antipov, M. Baccouche, and J. L. Dugelay, Face aging with conditional generative adversarial networks, in Proc. 2017 IEEE Int. Conf. Image Processing (ICIP), Beijing, China, 2017, pp. 2089–2093.
S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, Generative adversarial text to image synthesis, in Proc. 33rd Int. Conf. Machine Learning, New York, NY, USA, 2016, pp. 1060–1069.
A. Odena, C. Olah, and J. Shlens, Conditional image synthesis with auxiliary classifier GANs, in Proc. 34th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 2642–2651.
H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas, StackGAN++: Realistic image synthesis with stacked generative adversarial networks, IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 8, pp. 1947–1962, 2019.
M. Cha, Y. L. Gwon, and H. T. Kung, Adversarial learning of semantic relevance in text to image synthesis, in Proc. 33rd AAAI Conf. Artificial Intelligence and 31st Innovative Applications of Artificial Intelligence Conf. and 9th AAAI Symp. Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 2019, pp. 3272–3279.
D. M. Souza, J. Wehrmann, and D. D. Ruiz, Efficient neural architecture for text-to-image synthesis, in Proc. 2020 Int. Joint Conf. Neural Networks (IJCNN), Glasgow, UK, 2020, pp. 1–8.
Y. Yang, L. Wang, D. Xie, C. Deng, and D. Tao, Multi-sentence auxiliary adversarial networks for fine-grained text-to-image synthesis, IEEE Trans. Image Process., vol. 30, pp. 2798–2809, 2021.
H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas, StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks, in Proc. 2017 IEEE Int. Conf. Computer Vision, Venice, Italy, 2017, pp. 5908–5916.
T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He, AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1316–1324.
H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang, Cross-modal contrastive learning for text-to-image generation, in Proc. 2021 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 833–842.
A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models, arXiv preprint arXiv: 2112.10741, 2022.
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, Hierarchical text-conditional image generation with CLIP latents, arXiv preprint arXiv: 2204.06125, 2022.
C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al., Photorealistic text-to-image diffusion models with deep language understanding, arXiv preprint arXiv: 2205.11487, 2022.
J. Liu, H. Bai, H. Zhang, and L. Liu, Near-real feature generative network for generalized zero-shot learning, in Proc. 2021 IEEE Int. Conf. Multimedia and Expo (ICME), Shenzhen, China, 2021, pp. 1–6.
S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, Learning what and where to draw, in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 217–225.
Z. Zhang, Y. Xie, and L. Yang, Photographic text-to-image synthesis with a hierarchically-nested adversarial network, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 6199–6208.
M. Tao, H. Tang, S. Wu, N. Sebe, F. Wu, and X. Y. Jing, DF-GAN: Deep fusion generative adversarial networks for text-to-image synthesis, arXiv preprint arXiv: 2008.05865, 2022.
J. Cheng, F. Wu, Y. Tian, L. Wang, and D. Tao, RiFeGAN: Rich feature generation for text-to-image synthesis from prior knowledge, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 10908–10917.
S. Ruan, Y. Zhang, K. Zhang, Y. Fan, F. Tang, Q. Liu, and E. Chen, DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision, Montreal, Canada, 2021, pp. 13940–13949.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in Proc. 38th Int. Conf. Machine Learning, Virtual Event, 2021, pp. 8748–8763.
A. Brock, J. Donahue, and K. Simonyan, Large scale GAN training for high fidelity natural image synthesis, arXiv preprint arXiv: 1809.11096, 2019.
T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, Microsoft COCO: Common objects in context, in Proc. 13th European Conf. Computer Vision, Zurich, Switzerland, 2014, pp. 740–755.
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, Improved techniques for training GANs, in Proc. 30th Int. Conf. Neural Information Processing Systems, Barcelona, Spain, 2016, pp. 2234–2242.
M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, GANs trained by a two time-scale update rule converge to a local Nash equilibrium, in Proc. 31st Int. Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6629–6640.