STATE: Learning structure and texture representations for novel view synthesis

Xinyi Jing; Qiao Feng; Yu-Kun Lai; Jinsong Zhang; Yuanqiang Yu; Kun Li

doi:10.1007/s41095-022-0301-9

Computational Visual Media 2023, 9(4): 767-786 https://doi.org/10.1007/s41095-022-0301-9

Research Article |

Open Access | Issue | Published: 11 July 2023

STATE: Learning structure and texture representations for novel view synthesis

Show Author's Information Hide Author's Information Xinyi Jing^{¹^,^*}, Qiao Feng^{¹^,^*}, Yu-Kun Lai^², Jinsong Zhang^¹, Yuanqiang Yu^¹, Kun Li^¹(

)

1College of Intelligence and Computing, Tianjin University, Tianjin 300350, China

2School of Computer Science and Informatics, Cardiff University, Cardiff CF24 4AG, UK

* Xinyi Jing and Qiao Feng contributed equally to this work.

Keywords:

sparse views, novel view synthesis, spatio-view attention, structure representation, texture representation

Cite this article:

Jing X, Feng Q, Lai Y-K, et al. STATE: Learning structure and texture representations for novel view synthesis. Computational Visual Media, 2023, 9(4): 767-786. https://doi.org/10.1007/s41095-022-0301-9

Download citation

EndNote(RIS)

BibTeX

299

Views

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text Electronic supplementary material About this article

Abstract

Novel viewpoint image synthesis is very challenging, especially from sparse views, due to large changes in viewpoint and occlusion. Existing image-based methods fail to generate reasonable results for invisible regions, while geometry-based methods have difficulties in synthesizing detailed textures. In this paper, we propose STATE, an end-to-end deep neural network, for sparse view synthesis by learning structure and texture representations. Structure is encoded as a hybrid feature field to predict reasonable structures for invisible regions while maintaining original structures for visible regions, and texture is encoded as a deformed feature map to preserve detailed textures. We propose a hierarchical fusion scheme with intra-branch and inter-branch aggregation, in which spatio-view attention allows multi-view fusion at the feature level to adaptively select important information by regressing pixel-wise or voxel-wise confidence maps. By decoding the aggregated features, STATE is able to generate realistic images with reasonable structures and detailed textures. Experimental results demonstrate that our method achieves qualitatively and quantitatively better results than state-of-the-art methods. Our method also enables texture and structure editing applications benefiting from implicit disentanglement of structure and texture. Our code is available at http://cic.tju.edu.cn/faculty/likun/projects/STATE.

Full text

Abstract

Full text

Outline

Electronic supplementary material

About this article

STATE: Learning structure and texture representations for novel view synthesis

Show Author's information Hide Author's Information Xinyi Jing^{¹^,^*}, Qiao Feng^{¹^,^*}, Yu-Kun Lai^², Jinsong Zhang^¹, Yuanqiang Yu^¹, Kun Li^¹(

)

1College of Intelligence and Computing, Tianjin University, Tianjin 300350, China

2School of Computer Science and Informatics, Cardiff University, Cardiff CF24 4AG, UK

* Xinyi Jing and Qiao Feng contributed equally to this work.

Abstract

Keywords: sparse views, novel view synthesis, spatio-view attention, structure representation, texture representation

References(50)

[1]

Tatarchenko, M.; Dosovitskiy, A.; Brox, T. Multi-view 3D models from single images with a convolutional network. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9911. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 322–337, 2016.

DOI

[2]

Yang, J.; Reed, S. E.; Yang, M.-H.; Lee, H.Weakly-supervised disentangling with recurrent transformations for 3D view synthesis. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 1, 1099–1107, 2015.

[3]

Ren, Y. R.; Yu, X. M.; Chen, J. M.; Li, T. H.; Li, G. Deep image spatial transformation for person image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7687–7696, 2020.

DOI

[4]

Sun, S. H.; Huh, M.; Liao, Y. H.; Zhang, N.; Lim, J. J. Multi-view to novel view: Synthesizing novel views with self-learned confidence. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11207. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 162–178, 2018.

DOI

[5]

Zhou, T. H.; Tulsiani, S.; Sun, W. L.; Malik, J.; Efros, A. A. View synthesis by appearance flow. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9908. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 286–301, 2016.

DOI

[6]

Flynn, J.; Neulander, I.; Philbin, J.; Snavely, N. Deep stereo: Learning to predict new views from the world’s imagery. In: Proceedings of the IEEE Conference on Computer Vision and Pattern, 5515–5524, 2016.

DOI

[7]

Tulsiani, S.; Zhou, T. H.; Efros, A. A.; Malik, J. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 209–217, 2017.

DOI

[8]

Lê, H. Â.; Mensink, T.; Das, P.; Gevers, T. Novel view synthesis from single images via point cloud transformation. In: Proceedings of the British Machine Vision Conference, 2020.

[9]

Sitzmann, V.; Zollhoefer, M.; Wetzstein, G. Scene representation networks: Continuous 3D-structure-aware neural scene representations. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article No. 101, 1121–1132, 2019.

[10]

Olszewski, K.; Tulyakov, S.; Woodford, O.; Li, H.; Luo, L. J. Transformable bottleneck networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7647–7656, 2019.

DOI

[11]

Yu, A.; Ye, V.; Tancik, M.; Kanazawa, A. pixelNeRF: Neural radiance fields from one or few images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4576–4585, 2021.

DOI

[12]

Ali Eslami, S. M.; Jimenez Rezende, D.; Besse, F.; Viola, F.; Morcos, A. S.; Garnelo, M.; Ruderman, A.; Rusu, A. A.; Danihelka, I.; Gregor, K.; et al. Neural scene representation and rendering. Science Vol. 360, No. 6394, 1204–1210, 2018.

DOI Google Scholar

[13]

Liu, X. F.; Guo, Z. H.; You, J.; Vijaya Kumar, B. V. K. Dependency-aware attention control for image set-based face recognition. IEEE Transactions on Information Forensics and Security Vol. 15, 1501–1512, 2020.

DOI Google Scholar

[14]

Liu, X. F.; Kumar, B. V. K. V.; Yang, C.; Tang, Q. M.; You, J. Dependency-aware attention control for unconstrained face recognition with image sets. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11215. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 573–590, 2018.

DOI

[15]

Trevithick, A.; Yang, B. GRF: Learning a general radiance field for 3D representation and rendering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 15162–15172, 2021.

DOI

[16]

Yan, X.; Yang, J.; Yumer, E.; Guo, Y.; Lee, H. Perspective transformer nets: Learning single-view 3D object reconstruction without 3D supervision. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, 1704–1712, 2016.

[17]

Kim, J.; Kim, Y. M. Novel view synthesis with skip connections. In: Proceedings of the IEEE International Conference on Image Processing, 1616–1620, 2020.

DOI

[18]

Yin, M. Y.; Sun, L.; Li, Q. L. ID-unet: Iterative soft and hard deformation for view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7216–7225, 2021.

DOI

[19]

Kwon, Y.; Petrangeli, S.; Kim, D.; Wang, H. L.; Fuchs, H.; Swaminathan, V. Rotationally-consistent novel view synthesis for humans. In: Proceedings of the 28th ACM International Conference on Multimedia, 2308–2316, 2020.

DOI

[20]

Jaderberg, M.; Simonyan, K.; Zisserman, A.; Kavukcuoglu, K. Spatial transformer networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, Vol. 2, 2017–2025, 2015.

[21]

Park, E.; Yang, J. M.; Yumer, E.; Ceylan, D.; Berg, A. C. Transformation-grounded image generation network for novel 3D view synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 702–711, 2017.

DOI

[22]

Song, J.; Chen, X.; Hilliges, O. Monocular neural image based rendering with continuous view control. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 4089–4099, 2019.

DOI

[23]

Hou, Y. X.; Solin, A.; Kannala, J. Novel view synthesis via depth-guided skip connections. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 3118–3127, 2021.

DOI

[24]

Choy, C. B.; Xu, D. F.; Gwak, J.; Chen, K.; Savarese, S. 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 628–644, 2016.

DOI

[25]

Girdhar, R.; Fouhey, D. F.; Rodriguez, M.; Gupta, A. Learning a predictable and generative vector representation for objects. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9910. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 484–499, 2016.

DOI

[26]

Kar, A.; Häne, C.; Malik, J. Learning a multi-view stereo machine. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 364–375, 2017.

[27]

Park, J. J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. DeepSDF: Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 165–174, 2019.

DOI

[28]

Saito, S.; Huang, Z.; Natsume, R.; Morishima, S.; Li, H.; Kanazawa, A. PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2304–2314, 2019.

DOI

[29]

Guo, P. S.; Bautista, M. A.; Colburn, A.; Yang, L.; Ulbricht, D.; Susskind, J. M.; Shan, Q. Fast and explicit neural view synthesis. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 11–20, 2022.

DOI

[30]

Lombardi, S.; Simon, T.; Saragih, J.; Schwartz, G.; Lehrmann, A.; Sheikh, Y. Neural volumes: Learning dynamic renderable volumes from images. ACM Transactions on Graphics Vol. 38, No. 4, Article No. 65, 2019.

DOI Google Scholar

[31]

Nguyen-Phuoc, T.; Li, C.; Theis, L.; Richardt, C.; Yang, Y. L. HoloGAN: Unsupervised learning of 3D representations from natural images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7587–7596, 2019.

DOI

[32]

Nguyen-Phuoc, T.; Richardt, C.; Mai, L.; Yang, Y. L.; Mitra, N. BlockGAN: Learning 3D object-aware scene representations from unlabelled images. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, Article No. 568, 6767–6778, 2020.

[33]

Niemeyer, M.; Mescheder, L.; Oechsle, M.; Geiger, A. Differentiable volumetric rendering: Learning implicit 3D representations without 3D supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3501–3512, 2020.

DOI

[34]

Galama, Y.; Mensink, T. IterGANs: Iterative GANs to learn and control 3D object transformation. Computer Vision and Image Understanding Vol. 189, 102803, 2019.

DOI Google Scholar

[35]

Mildenhall, B.; Srinivasan, P. P.; Tancik, M.; Barron, J. T.; Ramamoorthi, R.; Ng, R. NeRF: Representing scenes as neural radiance fields for view synthesis. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 405–421, 2020.

DOI

[36]

Tewari, A.; Fried, O.; Thies, J.; Sitzmann, V.; Lombardi, S.; Sunkavalli, K.; Martin-Brualla, R.; Simon, T.; Saragih, J.; Nießner, M.; et al. State of the art on neural rendering. Computer Graphics Forum Vol. 39, No. 2, 701–727, 2020.

DOI Google Scholar

[37]

Wang, Z.; Bovik, A. C.; Sheikh, H. R.; Simoncelli, E. P. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing Vol. 13, No. 4, 600–612, 2004.

DOI Google Scholar

[38]

Johnson, J.; Alahi, A.; Li, F. F. Perceptual losses for real-time style transfer and super-resolution. In: Computer Vision – ECCV 2016. Lecture Notes in Computer Science, Vol. 9906. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 694–711, 2016.

DOI

[39]

Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Google Scholar

[40]

Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S. A.; Huang, Z. H.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision Vol. 115, No. 3, 211–252, 2015.

DOI Google Scholar

[41]

Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Communications of the ACM Vol. 63, No. 11, 139–144, 2020.

DOI Google Scholar

[42]

Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. In: Proceedings of the 3rd International Conference for Learning Representations, 2015.

[43]

Chang, A. X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q. X.; Li, Z. M.; Savarese, S.; Savva, M.; Song, S. R.; Su, H.; et al. ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012, 2015.

Google Scholar

[44]

Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 586–595, 2018.

DOI

[45]

Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6629–6640, 2017.

[46]

Chibane, J.; Bansal, A.; Lazova, V.; Pons-Moll, G. Stereo radiance fields (SRF): Learning view synthesis for sparse views of novel scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7907–7916, 2021.

DOI

[47]

Riegler, G.; Koltun, V. Free view synthesis. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12364. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 623–640, 2020.

DOI

[48]

Gretton, A.; Fukumizu, C.; Teo, H.; Song, L.; Schölkopf, B.; Smola, A. J. A kernel statistical test of independence. In: Proceedings of the 20th International Conference on Neural Information Processing Systems, 585–592, 2007.

[49]

Hu, S. M.; Liang, D.; Yang, G. Y.; Yang, G. W.; Zhou, W. Y. Jittor: A novel deep learning framework with meta-operators and unified graph execution. Science China Information Sciences Vol. 63, No. 12, 222103, 2020.

DOI Google Scholar

[50]

Zhou, W. Y.; Yang, G. W.; Hu, S. M. Jittor-GAN: A fast-training generative adversarial network model zoo based on Jittor. Computational Visual Media Vol. 7, No. 1, 153–157, 2021.

DOI Google Scholar

Electronic supplementary material

Video

41095_0301_ESM(2).mp4

File

41095_0301_ESM(1).pdf (2.5 MB)

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 15 February 2022

Accepted: 16 June 2022

Published: 11 July 2023

Issue date: December 2023

Copyright

Acknowledgements

We are grateful to the Associate Editor and anonymous reviewers for their help in improving this paper.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.