AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (16.2 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

Text to image generation with bidirectional Multiway Transformers

Department of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Microsoft Research Asia, Beijing 100080, China
Show Author Information

Graphical Abstract

Abstract

In this study, we explore the potential of Multiway Transformers for text-to-image generation to achieve performance improvements through a concise and efficient decoupled model design and the inference efficiency provided by bidirectional encoding. We propose a method for improving the image tokenizer using pretrained Vision Transformers. Next, we employ bidirectional Multiway Transformers to restore the masked visual tokens combined with the unmasked text tokens. On the MS-COCO benchmark, our Multiway Transformers outperform vanilla Transformers, achieving superior FID scores and confirming the efficacy of the modality-specific parameter computation design. Ablation studies reveal that the fusion of visual and text tokens in bidirectional encoding contributes to improved model performance. Additionally, our proposed tokenizer outperforms VQGAN in image reconstruction quality and enhances the text-to-image generation results. By incorporating the additional CC-3M dataset for intermediate finetuning on our model with 688M parameters, we achieve competitive results with a finetuned FID score of 4.98 on MS-COCO.

References

[1]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. arXiv preprint arXiv: 2102.12092, 2021.
[2]
Ding, M.; Yang, Z.; Hong, W.; Zheng, W.; Zhou, C.; Yin, D.; Lin, J.; Zou, X.; Shao, Z.; Yang, H.; et al. CoView: Mastering text-to-image generation via transformers. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 19822–19835, 2021.
[3]
Yu, J.; Xu, Y.; Koh, J. Y.; Luong, T.; Baid, G.; Wang, Z.; Vasudevan, V.; Ku, A.; Yang, Y.; Ayan, B. K.; et al. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv: 2206.10789, 2022.
[4]
Chang, H.; Zhang, H.; Barber, J.; Maschinot, A.; Lezama, J.; Jiang, L.; Yang, M. H.; Murphy, K.; Freeman, W. T.; Rubinstein, M.; et al. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv: 2301.00704, 2023.
[5]

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Communications of the ACM Vol. 63, No. 11, 139–144, 2020.

[6]
Karras, T.; Laine, S.; Aila T. A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4401–4410, 2019.
[7]
Brock, A.; Donahue, J.; Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv: 1809.11096, 2018.
[8]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10684–10695, 2022.
[9]
Sohl-Dickstein, J.; Weiss, E. A.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv: 1503.03585, 2015.
[10]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, L.; Polosukhin, I. Attention is all you need, In: Proceedings of the 31st Conference on Neural Information Processing Systems, 5998–6008, 2017.
[11]
Kim, S.; Jo, D.; Lee, D.; Kim, J. MAGVLT: Masked generative vision-and-language transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23338–23348, 2023.
[12]
Lee, D.; Kim, C.; Kim, S.; Cho, M.; Han, W. S. Draft-and-revise: Effective image generation with contextual RQ-transformer. arXiv preprint arXiv: 2206.04452, 2022.
[13]
Devlin, J.; Chang, M-W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186, 2019.
[14]
You, H.; Guo, M.; Wang, Z.; Chang, K. W.; Baldridge, J.; Yu, J. CoBIT: A contrastive bi-directional image-text generation model. arXiv preprint arXiv: 2303.13455, 2023.
[15]
Wang, W.; Bao, H.; Dong, L.; Bjorck, J.; Peng, Z.; Liu, Q.; Aggarwal, K.; Mohammed, O. K.; Singhal, S.; Som, S.; et al. Image as a foreign language: BEIT pretraining for vision and vision-language tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19175–19186, 2023.
[16]
Bao, H.; Wang, W.; Dong, L.; Liu, Q.; Mohammed, O. K.; Aggarwal, K.; Som, S.; Wei, F. VLMo: Unified vision-language pre-training with mixture-of-modality-experts. arXiv preprint arXiv: 2111.02358, 2021.
[17]
Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; Parikh, D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6325–6334, 2017.
[18]
Suhr, A.; Zhou, S.; Zhang, A.; Zhang, I.; Bai, H.; Artzi, Y. A corpus for reasoning about natural language grounded in photographs. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 6418–6428, 2019.
[19]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv: 2010.11929, 2020.
[20]
Esser, P.; Rombach, R.; Ommer, B. Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12873–12883, 2021.
[21]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision – ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
[22]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S.; Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6629–6640, 2017.
[23]
Sharma, P.; Ding, N.; Goodman, S.; Soricut, R. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2556–2565, 2018.
[24]
Zhang, H.; Koh, J. Y.; Baldridge, J.; Lee, H.; Yang, Y. Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 833–842, 2021.
[25]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv: 2112.10741, 2021.
[26]
Yu, J.; Li, X.; Koh, J. Y.; Zhang, H.; Pang, R.; Qin, J.; Ku, A.; Xu, Y.; Baldridge, J.; Wu, Y. Vector-quantized image modeling with improved VQGAN. arXiv preprint arXiv: 2110.04627, 2021.
[27]
Rolfe, J. T. Discrete variational autoencoders. arXiv preprint arXiv: 1609.02200, 2017.
[28]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. 2018. Available at https://openai.com/index/language-unsupervised/
[29]
Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT pre-training of image transformers. arXiv preprint arXiv: 2106.08254, 2021.
[30]
Rajpurkar, P.; Zhang J.; Lopyrev, K.; Liang, P. SQuAD: 100, 000+ questions for machine comprehension of text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2383–2392, 2016.
[31]

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision Vol. 115, No. 3, 211–252, 2015.

[32]

Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision Vol. 127, No. 3, 302–321, 2019.

[33]
Johnson, J.; Alahi, A.; Li, F.-F. Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9906. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 694–711, 2016.
[34]
Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 586–595, 2018.
[35]
Isola, P.; Zhu, J. Y.; Zhou, T.; Efros, A. A. Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1125–1134, 2017.
[36]
Jang, E.; Gu, S.; Poole, B. Categorical reparameterization with Gumbel-Softmax, arXiv prirpint arXiv: 1611.01144, 2016.
[37]
Maddison, C. J.; Mnih, A.; Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv: 1611.00712, 2016.
[38]
Baevski, A.; Zhou, H.; Mohamed, A.; Auli, M. wav2vec 2.0: A framework for self-supervised learning of speech representations. arXiv preprint arXiv: 2006.11477, 2020.
[39]
Kudo, T.; Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 66–71, 2018.
[40]
Sennrich, R.; Haddow, B.; Birch, A. Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 1715–1725, 2016.
[41]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv: 2204.06125, 2022.
[42]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E. L.; Seyed Ghasemipour, K.; Ayan, B. K.; Sara Mahdavi, S.; Gontijo Lopes, R.; et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv: 2205.11487, 2022.
[43]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE/ CVF Conference on Computer Vision and Pattern Recognition, 1316–1324, 2018.
[44]
Zhou, Y.; Zhang, R.; Chen, C.; Li, C.; Tensmeyer, C.; Yu, T.; Gu, J.; Xu, J.; Sun, T. LAFITE: Towards language-free training for text-to-image generation. arXiv preprint arXiv: 2111.13792, 2021.
[45]
Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma, J.; Zhou, C.; Zhou, J.; Yang, H. OFA: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. arXiv preprint arXiv: 2202.03052, 2022.
[46]
Gafni, O.; Polyak, A.; Ashual, O.; Sheynin, S.; Parikh, D.; Taigman, Y. Make-A-scene: Scene-based text-to-image generation with human priors. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13675. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 89–106, 2022.
[47]
Chang, H.; Zhang, H.; Jiang, L.; Liu, C.; Freeman, W. T. MaskGIT: Masked generative image transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11315–11325, 2022.
[48]
Gu, J.; Bradbury, J.; Xiong, C.; Li, V. O. K.; Socher, R. Non-autoregressive neural machine translation. arXiv preprint arXiv: 1711.02281, 2017.
[49]
Ghazvininejad, M.; Levy, O.; Liu, Y.; Zettlemoyer, L. Mask-predict: Parallel decoding of conditional masked language models. arXiv preprint arXiv: 1904.09324, 2019.
[50]
Kingma, D. P.; Welling, M. Auto-encoding variational Bayes. arXiv preprint arXiv: 1312.6114, 2013.
[51]
van den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural discrete representation learning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6309–6318, 2017.
[52]
Razavi, A.; van den Oord, A.; Vinyals, O. Generating diverse high-fidelity images with VQ-VAE-2. In: Proceedings of the 33rd Conference on Neural Information Processing Systems, 2019.
[53]
Esser, P.; Rombach, R.; Blattmann, A.; Ommer, B. ImageBART: Bidirectional context with multinomial diffusion for autoregressive image synthesis. arXiv preprint arXiv: 2108.08827, 2021.
[54]
Lee, D.; Kim, C.; Kim, S.; Cho, M.; Han, W. S. Autoregressive image generation using residual quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11523–11532, 2022.
[55]
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980, 2015.
[56]

Murphy, K.; Schölkopf, B.; Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research Vol. 15, No. 1, 1929–1958, 2014.

[57]
Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K. Q. Deep networks with stochastic depth. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9908. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 646–661, 2016.
[58]
Peng, Z.; Dong, L.; Bao, H.; Ye, Q.; Wei, F. BEiT v2: Masked image modeling with vector-quantized visual tokenizers. arXiv preprint arXiv: 2208.06366, 2022.
[59]
Wei, L.; Xie, L.; Zhou, W.; Li, H.; Tian, Q. MVP: Multimodality-guided visual pre-training. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13690. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 337–353, 2022.
[60]
Wei, C.; Fan, H.; Xie, S.; Wu, C. Y.; Yuille, A.; Feichtenhofer, C. Masked feature prediction for self-supervised visual pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14668–14678, 2022.
[61]
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In: Proceedings of the 38th International Conference on Machine Learning, 8748–8763, 2021.
[62]
Clark, K.; Luong, M-T.; Le, Q. V.; Manning, C. D. ELECTRA: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv: 2003.10555, 2020.
[63]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. arXiv preprint arXiv: 2104.14294, 2021.
[64]
He, K.; Chen, X.; Xie, S.; Li, Y.; Dollár, P.; Girshick, R. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15979–15988, 2022.
[65]
Chen, X.; Xie, S.; He, K. An empirical study of training self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9620–9629, 2021.
[66]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. arXiv preprint arXiv: 2012.12877, 2020.
[67]
Pruksachatkun, Y.; Phang, J.; Liu H.; Htut, P. M.; Zhang X.; Pang, R. Y.; Vania, C.; Kann, K.; Bowman, S. R. Intermediate-task transfer learning with pretrained language models: When and why does it work? In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5231–5247, 2020.
[68]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X.; Chen, X. Improved techniques for training GANs. In: Proceedings of the 30th Conference on Neural Information Processing Systems, 2016.
[69]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.-H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In: Proceedings of the 38th International Conference on Machine Learning, 4904–4916, 2021.
[70]
Schuhmann, C.; Vencu, R.; Beaumont, R.; Kaczmarczyk, R.; Mullis, C.; Katta, A.; Coombes, T.; Jitsev, J.; Komatsuzaki, A. LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs. arXiv preprint arXiv: 2111.02114, 2021.
[71]
Sutskever, I.; Vinyals, O.; Le, Q. V. Sequence to sequence learning with neural networks. In: Proceedings of the 28th International Conference on Neural Information Processing System, 3104–3112, 2014.
[72]
Kingma, D. P.; Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv: 1312.6114, 2013.
[73]
Chen, M.; Radford, A.; Child, R.; Wu, J.; Jun, H.; Luan, D.; Sutskever, I. Generative pretraining from pixels. In: Proceedings of the 37th International Conference on Machine Learning, 1691–1703, 2020.
[74]

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137–1149, 2017.

[75]
Huang, Z.; Zeng, Z.; Liu, B.; Fu, D.; Fu, J. Pixel-BERT: Aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv: 2004.00849, 2020.
[76]
Kim, W.; Son, B.; Kim, I. ViLT: Vision-and-language transformer without convolution or region supervision. In: Proceedings of the 38th International Conference on Machine Learning, 5583–5594, 2021.
[77]
Li, J.; Selvaraju, R. R.; Gotmare, A. D.; Joty, S.; Xiong, C.; Hoi, S. Align before fuse: Vision and language representation learning with momentum distillation. arXiv preprint arXiv: 2107.07651, 2021.
[78]
Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. CoCa: Contrastive captioners are image-text foundation models. arXiv preprint arXiv: 2205.01916, 2022.
[79]
Zeng, Y.; Zhang, X.; Li, H. Multi-grained vision language pre-training: Aligning texts with visual concepts. arXiv preprint arXiv: 2111.08276, 2021.
[80]
LeCun, Y.; Bengio, Y. Convolutional networks for images, speech, and time series. In: The hand book of brain theory and neural networks. Arbib, M. A. Eds. MIT Press, 255–258, 2022.
[81]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. arXiv preprint arXiv: 2002.05709, 2020.
[82]

Wang, W.; Xie, E.; Li, X.; Fan, D. P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with pyramid vision transformer. Computational Visual Media Vol. 8, No. 3, 415–424, 2022.

[83]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B.; Zhang, Q.; Yang, Y.; et al. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv: 2103.14030, 2021.
[84]

Guo, M. H.; Xu T. X.; Liu, J. J.; Liu, Z. N.; Jiang, P. T.; Mu, T. J.; Zhang, S. H.; Martin, R. R.; Cheng. M. M.; Hu, S. M. Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331–368, 2022.

[85]
Xie, Z.; Zhang, Z.; Cao, Y.; Lin, Y.; Bao, J.; Yao, Z.; Dai, Q.; Hu, H. SimMIM: A simple framework for masked image modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9653–9663, 2022.
[86]
Zhou, J.; Wei, C.; Wang, H.; Shen, W.; Xie, C.; Yuille, A.; Kong, T. iBOT: Image BERT pre-training with online tokenizer. arXiv preprint arXiv: 2111.07832, 2021.
[87]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, 353–355, 2018.
[88]
Kong, X.; Zhang, X. Understanding masked image modeling via learning occlusion invariant feature. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6241–6251, 2023.
Computational Visual Media
Pages 405-422
Cite this article:
Bao H, Dong L, Piao S, et al. Text to image generation with bidirectional Multiway Transformers. Computational Visual Media, 2025, 11(2): 405-422. https://doi.org/10.26599/CVM.2025.9450377

40

Views

2

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 14 June 2023
Accepted: 01 September 2023
Published: 08 May 2025
© The Author(s) 2025.

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

To submit a manuscript, please go to https://jcvm.org.

Return