Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
In this study, we explore the potential of Multiway Transformers for text-to-image generation to achieve performance improvements through a concise and efficient decoupled model design and the inference efficiency provided by bidirectional encoding. We propose a method for improving the image tokenizer using pretrained Vision Transformers. Next, we employ bidirectional Multiway Transformers to restore the masked visual tokens combined with the unmasked text tokens. On the MS-COCO benchmark, our Multiway Transformers outperform vanilla Transformers, achieving superior FID scores and confirming the efficacy of the modality-specific parameter computation design. Ablation studies reveal that the fusion of visual and text tokens in bidirectional encoding contributes to improved model performance. Additionally, our proposed tokenizer outperforms VQGAN in image reconstruction quality and enhances the text-to-image generation results. By incorporating the additional CC-3M dataset for intermediate finetuning on our model with 688M parameters, we achieve competitive results with a finetuned FID score of 4.98 on MS-COCO.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Communications of the ACM Vol. 63, No. 11, 139–144, 2020.
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision Vol. 115, No. 3, 211–252, 2015.
Zhou, B.; Zhao, H.; Puig, X.; Xiao, T.; Fidler, S.; Barriuso, A.; Torralba, A. Semantic understanding of scenes through the ADE20K dataset. International Journal of Computer Vision Vol. 127, No. 3, 302–321, 2019.
Murphy, K.; Schölkopf, B.; Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research Vol. 15, No. 1, 1929–1958, 2014.
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137–1149, 2017.
Wang, W.; Xie, E.; Li, X.; Fan, D. P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. PVT v2: Improved baselines with pyramid vision transformer. Computational Visual Media Vol. 8, No. 3, 415–424, 2022.
Guo, M. H.; Xu T. X.; Liu, J. J.; Liu, Z. N.; Jiang, P. T.; Mu, T. J.; Zhang, S. H.; Martin, R. R.; Cheng. M. M.; Hu, S. M. Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331–368, 2022.
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
To submit a manuscript, please go to https://jcvm.org.