Journal Home > Volume 8 , Issue 1

In many applications of computer graphics, art, and design, it is desirable for a user to provide intuitive non-image input, such as text, sketch, stroke, graph, or layout, and have a computer system automatically generate photo-realistic images according to that input. While classically, works that allow such automatic image content generation have followed a framework of image retrieval and composition, recent advances in deep generative models such as generative adversarial networks (GANs), variational autoencoders (VAEs), and flow-based methods have enabled more powerful and versatile image generation approaches. This paper reviews recent works for image synthesis given intuitive user input, covering advances in input versatility, image generation methodology, benchmark datasets, and evaluation metrics. This motivates new perspectives on input representation and interactivity, cross fertilization between major image generation paradigms, and evaluation and comparison of generation methods.


menu
Abstract
Full text
Outline
About this article

Deep image synthesis from intuitive user input: A review and perspectives

Show Author's information Yuan Xue1Yuan-Chen Guo2Han Zhang3Tao Xu4Song-Hai Zhang2Xiaolei Huang1( )
College of Information Sciences and Technology, the Pennsylvania State University, University Park, PA, USA
Department of Computer Science and Technology, Tsinghua University, Beijing, China, and Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, China
Google Brain, Mountain View, CA, USA
Facebook, Menlo Park, CA, USA

Abstract

In many applications of computer graphics, art, and design, it is desirable for a user to provide intuitive non-image input, such as text, sketch, stroke, graph, or layout, and have a computer system automatically generate photo-realistic images according to that input. While classically, works that allow such automatic image content generation have followed a framework of image retrieval and composition, recent advances in deep generative models such as generative adversarial networks (GANs), variational autoencoders (VAEs), and flow-based methods have enabled more powerful and versatile image generation approaches. This paper reviews recent works for image synthesis given intuitive user input, covering advances in input versatility, image generation methodology, benchmark datasets, and evaluation metrics. This motivates new perspectives on input representation and interactivity, cross fertilization between major image generation paradigms, and evaluation and comparison of generation methods.

Keywords: image synthesis, intuitive user input, deep generative models, synthesized image quality evaluation

References(155)

[1]
Zhang, H.; Xu, T.; Li, H. S.; Zhang, S. T.; Wang, X. G.; Huang, X. L.; Metaxas, D. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, 5908–5916, 2017.
DOI
[2]
Qiao, T. T.; Zhang, J.; Xu, D. Q.; Tao, D. C. MirrorGAN: Learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1505–1514, 2019.
DOI
[3]
Zhu, M. F.; Pan, P. B.; Chen, W.; Yang, Y. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5795–5803, 2019.
DOI
[4]
Zhang, H.; Koh, J. Y.; Baldridge, J.; Lee, H.; Yang, Y. F. Cross-modal contrastive learning for text-to-image generation. arXiv preprint arXiv:2101.04702, 2021.
[5]
Karras, T.; Laine, S.; Aila, T. M. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4396–4405, 2019.
DOI
[6]
Sangkloy, P.; Lu, J. W.; Fang, C.; Yu, F.; Hays, J. Scribbler: Controlling deep image synthesis with sketch and color. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6836–6845, 2017.
DOI
[7]
Ghosh, A.; Zhang, R.; Dokania, P.; Wang, O.; Efros, A.; Torr, P.; Shechtman, E. Interactive sketch & fill: Multiclass sketch-to-image translation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1171–1180, 2019.
DOI
[8]
Gao, C. Y.; Liu, Q.; Xu, Q.; Wang, L. M.; Liu, J. Z.; Zou, C. Q. SketchyCOCO: Image generation from freehand scene sketches. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5173–5182, 2020.
DOI
[9]
Liu, R.; Yu, Q.; Yu, S. Unsupervised sketch-to-photo synthesis. arXiv preprint arXiv:1909.08313, 2019.
[10]
Li, J. N.; Yang, J. M.; Hertzmann, A.; Zhang, J. M.; Xu, T. F. LayoutGAN: Generating graphic layouts with wireframe discriminators arXiv preprint arXiv:1901.06767, 2019.
[11]
Xue, Y.; Zhou, Z. H.; Huang, X. L. Neural wireframe renderer: Learning wireframe to image translations. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12371. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 279–295, 2020.
[12]
Wang, M.; Lyu, X. Q.; Li, Y. J.; Zhang, F. L. VR content creation and exploration with deep learning: A survey. Computational Visual Media Vol. 6, No. 1, 3–28, 2020.
[13]
Kipf, T. N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
[14]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training GANs. In: Proceedings of the 29th International Conference on Neural Information Processing Systems, 2234–2242, 2016.
[15]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826, 2016.
DOI
[16]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6629–6640, 2017.
[17]
Wang, Z.; Bovik, A. C.; Sheikh, H. R.; Simoncelli, E. P. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing Vol. 13, No. 4, 600–612, 2004.
[18]
Wang, Z.; Simoncelli, E. P.; Bovik, A. C. Multiscale structural similarity for image quality assessment. In: Proceedings of the 37th Asilomar Conference on Signals, Systems & Computers, 1398–1402, 2003.
[19]
Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 586–595, 2018.
DOI
[20]
Rezende, D. J.; Mohamed, S. Variational inference with normalizing flows. In: Proceedings of the International Conference on Machine Learning, 1530–1538, 2015.
[21]
Kingma, D. P.; Dhariwal, P. Glow: Generative flow with invertible 1×1 convolutions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 10215–10224, 2018.
[22]
Oord, A. V. D.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel recurrent neural networks. In: Proceedings of the International Conference on Machine Learning, 1747–1756, 2016.
[23]
Oord, A. V. D.; Kalchbrenner, N.; Espeholt, L.; Kavukcuoglu, K.; Vinyals, O.; Graves, A. Conditional image generation with pixelCNN decoders. In: Proceedings of the 29th International Conference on Neural Information Processing Systems, 4790–4798, 2016.
[24]
Salimans, T.; Karpathy, A.; Chen, X.; Kingma, D. P. PixelCNN++: Improving the pixelCNN with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
[25]
Xu, T.; Zhang, P. C.; Huang, Q. Y.; Zhang, H.; Gan, Z.; Huang, X. L.; He, X. D. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316–1324, 2018.
DOI
[26]
Lu, Y. Y.; Wu, S. Z.; Tai, Y. W.; Tang, C. K. Image generation from sketch constraint using contextual GAN. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11220. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 213–228, 2018.
[27]
Ma, L.; Jia, X.; Sun, Q.; Schiele, B.; Tuytelaars, T.; Van Gool, L. Pose guided person image generation. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, 406–416, 2017.
DOI
[28]
Ma, L. Q.; Sun, Q. R.; Georgoulis, S.; Van Gool, L.; Schiele, B.; Fritz, M. Disentangled person image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 99–108, 2018.
[29]
Siarohin, A.; Sangineto, E.; Lathuilière, S.; Sebe, N. Deformable GANs for pose-based human image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3408–3416, 2018.
DOI
[30]
Song, S. J.; Zhang, W.; Liu, J. Y.; Mei, T. Unsupervised person image generation with semantic parsing transformation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2352–2361, 2019.
DOI
[31]
Zhu, Z.; Huang, T. T.; Shi, B. G.; Yu, M.; Wang, B. F.; Bai, X. Progressive pose attention transfer for person image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2342–2351, 2019.
DOI
[32]
Belongie, S.; Malik, J.; Puzicha, J. Shape context: A new descriptor for shape matching and object recognition. In: Proceedings of the International Conference on Neural Information Processing Systems, 831–837, 2000.
[33]
Chen, T.; Cheng, M. M.; Tan, P.; Shamir, A.; Hu, S. M. Sketch2Photo: Internet image montage. ACM Transactions on Graphics Vol. 28, No. 5, Article No. 124, 2009.
[34]
Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio Y. Generative adversarial nets. In: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2672–2680, 2014.
[35]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
[36]
Miyato, T.; Koyama, M. cGANs with projection discriminator. In: Proceedings of the International Conference on Learning Representations, 2018.
[37]
Odena, A.; Olah, C.; Shlens, J. Conditional image synthesis with auxiliary classifier GANs. In: Proceedings of the International Conference on Machine Learning, 2642–2651, 2017.
[38]
Kingma, D. P.; Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
[39]
Sohn, K.; Lee, H.; Yan, X. Learning structured output representation using deep conditional generative models. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, 3483–3491, 2015.
[40]
Klys, J.; Snell, J.; Zemel, R. Learning latent subspaces in variational autoencoders. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6444–6454, 2018.
[41]
Ivanov, O.; Figurnov, M.; Vetrov, D. Variational autoencoder with arbitrary conditioning. In: Proceedings of the International Conference on Learning Representations, 2018.
[42]
Larsen, A. B. L.; Sønderby, S. K.; Larochelle, H.; Winther, O. Auto encoding beyond pixels using a learned similarity metric. In: Proceedings of the International Conference on Machine Learning, 1558–1566, 2016.
[43]
Bao, J. M.; Chen, D.; Wen, F.; Li, H. Q.; Hua, G. C. VAE-GAN: Fine-grained image generation through asymmetric training. In: Proceedings of the IEEE International Conference on Computer Vision, 2764–2773, 2017.
DOI
[44]
Nilsback, M. E.; Zisserman, A. Automated flower classification over a large number of classes. In: Proceedings of the 6th Indian Conference on Computer Vision, Graphics & Image Processing, 722–729, 2008.
DOI
[45]
Welinder, P.; Branson, S.; Mita, T.; Wah, C.; Schrofi, F.; Belongie, S.; Perona, P. Caltech-UCSD Birds200. Technical Report CNS-TR-2010-001. California Institute of Technology, 2010.
[46]
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollfiar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision–ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
DOI
[47]
Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In: Proceedings of the International Conference on Machine Learning, 1060–1069, 2016.
[48]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
[49]
Mansimov, E.; Parisotto, E.; LeiBa, J.; Salakhutdinov, R. Generating images from captions with attention. arXiv preprint arXiv:1511.02793, 2015.
[50]
Zhang, H.; Xu, T.; Li, H. S.; Zhang, S. T.; Wang, X. G.; Huang, X. L.; Metaxas, D. N. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 8, 1947–1962, 2019.
[51]
Zhang, Z. Z.; Xie, Y. P.; Yang, L. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6199–6208, 2018.
DOI
[52]
Isola, P.; Zhu, J. Y.; Zhou, T. H.; Efros, A. A. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5967–5976, 2017.
DOI
[53]
Yin, G. J.; Liu, B.; Sheng, L.; Yu, N. H.; Wang, X. G.; Shao, J. Semantics disentangling for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2322–2331, 2019.
DOI
[54]
Reed, S. E.; Akata, Z.; Mohan, S.; Tenka, S.; Schiele, B.; Lee, H. Learning what and where to draw. In: Proceedings of the 29th International Conference on Neural Information Processing Systems, 217–225, 2016.
[55]
Hong, S.; Yang, D. D.; Choi, J.; Lee, H. Inferring semantic layout for hierarchical text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7986–7994, 2018.
DOI
[56]
Chen, Q. F.; Koltun, V. Photographic image synthesis with cascaded refinement networks. In: Proceedings of the IEEE International Conference on Computer Vision, 1520–1529, 2017.
DOI
[57]
Li, W. B.; Zhang, P. C.; Zhang, L.; Huang, Q. Y.; He, X. D.; Lyu, S. W.; Gao, J. F. Object-driven text-to-image synthesis via adversarial training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12166–12174, 2019.
[58]
Girshick, R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 1440–1448, 2015.
DOI
[59]
Johnson, J.; Gupta, A.; Li, F. F. Image generation from scene graphs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1219–1228, 2018.
DOI
[60]
Krishna, R.; Zhu, Y. K.; Groth, O.; Johnson, J.; Hata, K. J.; Kravitz, J.; Chen, S.; KAlantidis, Y.; Li, L.-J.; Shamma, D. A. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision Vol. 123, No. 1, 32–73, 2017.
[61]
Caesar, H.; Uijlings, J.; Ferrari, V. COCO-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1209–1218, 2018.
DOI
[62]
Hinz, T.; Heinrich, S.; Wermter, S. Generating multiple objects at spatially distinct locations. arXiv preprint arXiv:1901.00686, 2019.
[63]
Tan, F. W.; Feng, S.; Ordonez, V. Text2Scene: Generating compositional scenes from textual descriptions. arXiv preprint arXiv:1809.01110, 2018.
[64]
Bodla, N.; Hua, G.; Chellappa, R. Semi-supervised FusedGAN for conditional image generation. In: Computer Vision–ECCV 2018. Lecture Notes in Computer Science, Vol. 11209. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 689–704, 2018.
DOI
[65]
Hinz, T.; Heinrich, S.; Wermter, S. Semantic object accuracy for generative text-to-image synthesis. arXiv preprint arXiv:1910.13321, 2019.
[66]
Wang, T. C.; Liu, M. Y.; Zhu, J. Y.; Tao, A.; Kautz, J.; Catanzaro, B. High-resolution image synthesis and semantic manipulation with conditional GANs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8798–8807, 2018.
DOI
[67]
Eitz, M.; Richter, R.; Hildebrand, K.; Boubekeur, T.; Alexa, M. Photosketcher: Interactive sketch-based image synthesis. IEEE Computer Graphics and Applications Vol. 31, No. 6, 56–66, 2011.
[68]
Hu, S.-M.; Zhang, F.-L.; Wang, M.; Martin, R. R.; Wang, J. PatchNet: A patch-based image representation for interactive library-driven image editing. ACM Transactions on Graphics Vol. 32, No. 6, Article No. 196, 2013.
[69]
Wang, J. Y.; Zhao, Y.; Qi, Q.; Huo, Q. M.; Zou, J.; Ge, C.; Liao, J. MindCamera: Interactive sketch-based image retrieval and synthesis. IEEE Access Vol. 6, 3765–3773, 2018.
[70]
Turmukhambetov, D.; Campbell, N. D. F.; Goldman, D. B.; Kautz, J. Interactive sketch-driven image synthesis. Computer Graphics Forum Vol. 34, No. 8, 130–142, 2015.
[71]
Xie, S. N.; Tu, Z. W. Holistically-nested edge detection. In: Proceedings of the IEEE International Conference on Computer Vision, 1395–1403, 2015.
DOI
[72]
Winnemöller, H.; Kyprianidis, J. E.; Olsen, S. C. XDoG: An eXtended difference-of-Gaussians compendium including advanced image stylization. Computers & Graphics Vol. 36, No. 6, 740–753, 2012.
[73]
Kang, H.; Lee, S.; Chui, C. K. Coherent line drawing. In: Proceedings of the 5th International Symposium on Non-photorealistic Animation and Rendering, 43–50, 2007.
DOI
[74]
Li, Y. J.; Fang, C.; Hertzmann, A.; Shechtman, E.; Yang, M. H. Im2Pencil: Controllable pencil illustration from photographs. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1525–1534, 2019.
[75]
Li, M. T.; Lin, Z.; Mech, R.; Yumer, E.; Ramanan, D. Photo-sketching: Inferring contour drawings from images. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 1403–1412, 2019.
[76]
Gastal, E. S. L.; Oliveira, M. M. Domain transform for edge-aware image and video processing. ACM Transactions on Graphics Vol. 30, No. 4, Article No. 69, 2011.
[77]
Hahn-Powell, G. V.; Archangeli, D. AutoTrace: An automatic system for tracing tongue contours. The Journal of the Acoustical Society of America Vol. 136, No. 4, 2104, 2014.
[78]
Simo-Serra, E.; Iizuka, S.; Sasaki, K.; Ishikawa, H. Learning to simplify. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 121, 2016.
[79]
Chen, W. L.; Hays, J. SketchyGAN: Towards diverse and realistic sketch to image synthesis. In: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, 9416–9425, 2018.
DOI
[80]
Li, Y. H.; Chen, X. J.; Wu, F.; Zha, Z. J. LinesToFacePhoto: Face photo generation from lines with conditional self-attention generative adversarial networks. In: Proceedings of the 27th ACM International Conference on Multimedia, 2323–2331, 2019.
[81]
Güçlütürk, Y.; Güçlü, U.; van Lier, R.; van Gerven, M. A. J. Convolutional sketch inversion. In: Computer Vision–ECCV 2016 Workshops. Lecture Notes in Computer Science, Vol. 9913. Hua, G.; Jégou, H. Eds. Springer Cham, 810–824, 2016.
[82]
Mescheder, L.; Geiger, A.; Nowozin, S. Which training methods for GANs do actually converge? arXiv preprint arXiv:1801.04406, 2018.
[83]
Huang, X.; Liu, M. Y.; Belongie, S.; Kautz, J. Multimodal unsupervised image-to-image translation. In: Computer Vision–ECCV 2018. Lecture Notes in Computer Science, Vol. 11207. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 179–196, 2018.
DOI
[84]
Portenier, T.; Hu, Q.; Szabó, A.; Bigdeli, S. A.; Favaro, P.; Zwicker, M. Faceshop: Deep sketch-based face image editing. arXiv preprint arXiv:1804.08972, 2018.
[85]
Xia, W.; Yang, Y.; Xue, J.-H. Calisketch: Stroke calibration and completion for high quality face image generation from poorly-drawn sketches. arXiv preprint arXiv:1911.00426, 2019.
[86]
Chen, S.-Y.; Su, W.; Gao, L.; Xia, S.; Fu, H. DeepFaceDrawing: Deep generation of face images from sketches. ACM Transactions on Graphics Vol. 39, No. 4, Article No. 72, 2020.
[87]
Sangkloy, P.; Burnell, N.; Ham, C.; Hays, J. The sketchy database. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 119, 2016.
[88]
Eitz, M.; Hays, J.; Alexa, M. How do humans sketch objects? ACM Transactions on Graphics Vol. 31, No. 4, Article No. 44, 2012.
[89]
Caesar, H.; Uijlings, J.; Ferrari, V. COCO-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1209–1218, 2018.
DOI
[90]
Huang, X.; Belongie, S. Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE International Conference on Computer Vision, 1510–1519, 2017.
DOI
[91]
Zhu, P. H.; Abdal, R.; Qin, Y. P.; Wonka, P. SEAN: Image synthesis with semantic region-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5103–5112, 2020.
DOI
[92]
Yu, Q.; Liu, F.; Song, Y. Z.; Xiang, T.; Hospedales, T. M.; Loy, C. C. Sketch me that shoe. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 799–807, 2016.
DOI
[93]
Krause, J.; Stark, M.; Jia, D.; Li, F. F. 3D object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 554–561, 2013.
DOI
[94]
Yu, A.; Grauman, K. Fine-grained visual comparisons with local learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 192–199, 2014.
DOI
[95]
Yu, A.; Grauman, K. Semantic jitter: Dense supervision for visual comparisons via synthetic images. In: Proceedings of the IEEE International Conference on Computer Vision, 5571–5580, 2017.
DOI
[96]
Liu, Z. W.; Luo, P.; Wang, X. G.; Tang, X. O. Deep learning face attributes in the wild. In: Proceedings of the IEEE International Conference on Computer Vision, 3730–3738, 2015.
DOI
[97]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
[98]
Wang, X. G.; Tang, X. O. Face photo-sketch synthesis and recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 31, No. 11, 1955–1967, 2009.
[99]
Johnson, M.; Brostow, G. J.; Shotton, J.; Arandjelovic, O.; Kwatra, V.; Cipolla, R. Semantic photo synthesis. Computer Graphics Forum Vol. 25, No. 3, 407–413, 2006.
[100]
Bansal, A.; Sheikh, Y.; Ramanan, D. Shapes and context: In-the-wild image synthesis & manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2312–2321, 2019.
DOI
[101]
Chen, Q. F.; Koltun, V. Photographic image synthesis with cascaded refinement networks. In: Proceedings of the IEEE International Conference on Computer Vision, 1520–1529, 2017.
DOI
[102]
Lassner, C.; Pons-Moll, G.; Gehler, P. V. A generative model of people in clothing. In: Proceedings of the IEEE International Conference on Computer Vision, 853–862, 2017.
DOI
[103]
Park, T.; Liu, M. Y.; Wang, T. C.; Zhu, J. Y. Semantic image synthesis with spatially-adaptive normalization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2332–2341, 2019.
DOI
[104]
Liu, X.; Yin, G.; Shao, J.; Wang, X.; Li, H. Learning to predict layout-to-image conditional convolutions for semantic image synthesis. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, 570–580, 2019.
[105]
Zhu, Z.; Xu, Z. L.; You, A. S.; Bai, X. Semantically multi-modal image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5466–5475, 2020.
DOI
[106]
Tang, H.; Xu, D.; Yan, Y.; Torr, P. H. S.; Sebe, N. Local class-specific and global image-level generative adversarial networks for semantic-guided scene generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7867–7876, 2020.
DOI
[107]
Qi, X. J.; Chen, Q. F.; Jia, J. Y.; Koltun, V. Semi-parametric image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8808–8816, 2018.
[108]
Wang, M.; Yang, G. Y.; Li, R. L.; Liang, R. Z.; Zhang, S. H.; Hall, P. M.; Hu, S.-M. Example-guided style-consistent image synthesis from semantic labeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1495–1504, 2019.
DOI
[109]
Liang, X. D.; Liu, S.; Shen, X. H.; Yang, J. C.; Liu, L. Q.; Dong, J.; Lin, L.; Yan, S. C. Deep human parsing with active template regression. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 37, No. 12, 2402–2414, 2015.
[110]
Liang, X. D.; Xu, C. Y.; Shen, X. H.; Yang, J. C.; Liu, S.; Tang, J. H.; Lin, L.; Yan, S. C. Human parsing with contextualized convolutional neural network. In: Proceedings of the IEEE International Conference on Computer Vision, 1386–1394, 2015.
DOI
[111]
Liu, Z. W.; Luo, P.; Qiu, S.; Wang, X. G.; Tang, X. O. DeepFashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1096–1104, 2016.
DOI
[112]
Lee, C. H.; Liu, Z. W.; Wu, L. Y.; Luo, P. MaskGAN: Towards diverse and interactive facial image manipulation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5548–5557, 2020.
DOI
[113]
Zhou, B. L.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ADE20K dataset. arXiv preprint arXiv:1608.05442, 2016.
[114]
Zhou, B. L.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ADE20K dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5122–5130, 2017.
DOI
[115]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from RGBD images. In: Computer Vision – ECCV 2012. Lecture Notes in Computer Science, Vol. 7576. Fitzgibbon, A.; Lazebnik, S.; Perona, P.; Sato, Y.; Schmid, C. Eds. Springer Berlin Heidelberg, 746–760, 2012.
DOI
[116]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele B. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3213–3223, 2016.
DOI
[117]
Bem, R. D.; Ghosh, A.; Boukhayma, A.; Ajanthan, T.; Siddharth, N.; Torr, P. A conditional deep generative model of people in natural images. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 1449–1458, 2019.
DOI
[118]
Chen, L. C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A. L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 40, No. 4, 834–848, 2018.
[119]
Chen, L. C.; Zhu, Y. K.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11211. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 833–851, 2018.
DOI
[120]
Balakrishnan, G.; Zhao, A.; Dalca, A. V.; Durand, F.; Guttag, J. Synthesizing images of humans in unseen poses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8340–8348, 2018.
DOI
[121]
Pumarola, A.; Agudo, A.; Sanfeliu, A.; Moreno-Noguer, F. Unsupervised person image synthesis in arbitrary poses. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8620–8628, 2018.
DOI
[122]
Dong, H.; Liang, X.; Gong, K.; Lai, H.; Zhu, J.; Yin, J. Soft-gated warping-GAN for pose-guided person image synthesis. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, 474–484, 2018.
[123]
Li, Y. N.; Huang, C.; Loy, C. C. Dense intrinsic appearance flow for human pose transfer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3688–3697, 2019.
[124]
Zheng, L.; Shen, L. Y.; Tian, L.; Wang, S. J.; Wang, J. D.; Tian, Q. Scalable person re-identification: A benchmark. In: Proceedings of the IEEE International Conference on Computer Vision, 1116–1124, 2015.
DOI
[125]
Yan, X. C.; Yang, J. M.; Sohn, K.; Lee, H. Attribute2Image: Conditional image generation from visual attributes. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9908. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 776–791, 2016.
DOI
[126]
Huang, G. B.; Ramesh, M.; Berg, T.; Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49. University of Massachusetts, 2007.
[127]
He, Z. L.; Zuo, W. M.; Kan, M. N.; Shan, S. G.; Chen, X. L. AttGAN: Facial attribute editing by only changing what you want. IEEE Transactions on Image Processing Vol. 28, No. 11, 5464–5478, 2019.
[128]
Zhang, G.; Kan, M. N.; Shan, S. G.; Chen, X. L. Generative adversarial network with spatial attention for face attribute editing. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11210. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 422–437, 2018.
[129]
Qian, S. J.; Lin, K. Y.; Wu, W.; Liu, Y.; Wang, Q.; Shen, F. M.; Qian, C.; He, R. Make a face: Towards arbitrary high fidelity face manipulation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10032–10041, 2019.
DOI
[130]
Men, Y. F.; Mao, Y. M.; Jiang, Y. N.; Ma, W. Y.; Lian, Z. H. Controllable person image synthesis with attribute-decomposed GAN. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5083–5092, 2020.
DOI
[131]
Lee, H.; Lee, S. G. Fashion attributes-to-image synthesis using attention-based generative adversarial network. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 462–470, 2019.
DOI
[132]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using real NVP. arXiv preprint arXiv: 1605.08803, 2016.
[133]
Zhao, B.; Meng, L. L.; Yin, W. D.; Sigal, L. Image generation from layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8576–8585, 2019.
DOI
[134]
Luo, A.; Zhang, Z. T.; Wu, J. J.; Tenenbaum, J. B. End-to-end optimization of scene layout. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3753–3762, 2020.
DOI
[135]
Song, S. R.; Yu, F.; Zeng, A.; Chang, A. X.; Savva, M.; Funkhouser, T. Semantic scene completion from a single depth image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 190–198, 2017.
DOI
[136]
Choi, Y.; Choi, M.; Kim, M.; Ha, J. W.; Kim, S.; Choo, J. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8789–8797, 2018.
DOI
[137]
Vahdat, A.; Kautz, J. NVAE: A deep hierarchical variational autoencoder. In: Proceedings of the 34th Conference on Neural Information Processing Systems, 2020.
[138]
Zhang, H.; Goodfellow, I. J.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In: Proceedings of the International Conference on Machine Learning, 7354–7363, 2019.
[139]
De Vries, H.; Strub, F.; Mary, J.; Larochelle, H.; Pietquin, O.; Courville A. Modulating early visual processing by language. In: Proceedings of the 30th Conference on Neural Information Processing Systems 6594–6604, 2017.
[140]
Miyato, T.; Kataoka, T.; Koyama, M.; Yoshida, Y. Spectral normalization for generative adversarial networks. In: Proceedings of the International Conference on Learning Representations, 2018.
[141]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning, Vol. 70, 214–223, 2017.
[142]
Gulrajani, I.; Ahmed, F.; Arjovsky, M.; Dumoulin, V.; Courville, A. C. Improved training of Wasserstein GANs. In: Proceedings of the 30th Conference on Neural Information Processing Systems, 5767–5777, 2017.
[143]
Mao, X. D.; Li, Q.; Xie, H. R.; Lau, R. Y. K.; Wang, Z.; Smolley, S. P. Least squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, 2813–2821, 2017.
DOI
[144]
Lim, J. H.; Ye, J. C. Geometric GAN. arXiv preprint arXiv:1705.02894, 2017.
[145]
Johnson, J.; Alahi, A.; Li, F. F. Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9906. Leibem, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 694–711, 2016.
DOI
[146]
Velifickovific, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lifio, P.; Bengio, Y. Graph attention networks. In: Proceedings of the International Conference on Learning Representations, 2018.
[147]
Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 36, No. 7, 1325–1339, 2014.
[148]
Li, Y. T.; Gan, Z.; Shen, Y. L.; Liu, J. J.; Cheng, Y.; Wu, Y. X.; Carin, L.; Carlson, D.; Gao, J. F. StoryGAN: A sequential conditional GAN for story visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6322–6331, 2019.
[149]
Pan, Y. W.; Qiu, Z. F.; Yao, T.; Li, H. Q.; Mei, T. To create what you tell: Generating videos from captions. In: Proceedings of the 25th ACM international Conference on Multimedia, 1789–1798, 2017.
DOI
[150]
Li, Y.; Min, M. R.; Shen, D.; Carlson, D.; Carin, L. Video generation from text. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
[151]
Wang, M.; Yang, G.-W.; Hu, S.-M.; Yau, S.-T.; Shamir, A. Write-a-video: Computational video montage from themed text. ACM Transactions on Graphics Vol. 38, No. 6, Article No. 177, 2019.
[152]
Chen, L. L.; Maddox, R. K.; Duan, Z. Y.; Xu, C. L. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7824–7833, 2019.
DOI
[153]
Zhou, H.; Liu, Y.; Liu, Z. W.; Luo, P.; Wang, X. G. Talking face generation by adversarially disentangled audio-visual representation. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 33, 9299–9306, 2019.
[154]
Wen, X.; Wang, M.; Richardt, C.; Chen, Z. Y.; Hu, S. M. Photorealistic audio-driven video portraits. IEEE Transactions on Visualization and Computer Graphics Vol. 26, No. 12, 3457–3466, 2020.
[155]
Mescheder, L.; Nowozin, S.; Geiger, A. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. In: Proceedings of the 34th International Conference on Machine Learning, 2391–2400, 2017.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 25 January 2021
Accepted: 27 March 2021
Published: 27 October 2021
Issue date: March 2022

Copyright

© The Author(s) 2021.

Acknowledgements

The co-authors Y.-C. Guo and S.-H. Zhang were supported by the National Natural Science Foundation of China (Project Nos. 61521002 and 61772298), a Research Grant of Beijing Higher Institution Engineering Research Center, and the Tsinghua–Tencent Joint Laboratory for Internet Innovation Technology.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.

Return