HDR-Net-Fusion: Real-time 3D dynamic scene reconstruction with a hierarchical deep reinforcement network

Hao-Xuan Song; Jiahui Huang; Yan-Pei Cao; Tai-Jiang Mu

doi:10.1007/s41095-021-0230-z

Computational Visual Media 2021, 7(4): 419-435 https://doi.org/10.1007/s41095-021-0230-z

Research Article |

Open Access | Issue | Published: 05 August 2021

HDR-Net-Fusion: Real-time 3D dynamic scene reconstruction with a hierarchical deep reinforcement network

Show Author's Information Hide Author's Information Hao-Xuan Song^¹, Jiahui Huang^¹, Yan-Pei Cao^², Tai-Jiang Mu^¹(

)

1BNRist, Department of Computer Science and Technology, Tsinghua University, Beiing 100084, China

2Kuaishou Technology Co., Ltd., Beijing 100085, China

Keywords:

deep reinforcement learning, dynamic 3D scene reconstruction, point cloud completion, deep neural networks

Cite this article:

Song H-X, Huang J, Cao Y-P, et al. HDR-Net-Fusion: Real-time 3D dynamic scene reconstruction with a hierarchical deep reinforcement network. Computational Visual Media, 2021, 7(4): 419-435. https://doi.org/10.1007/s41095-021-0230-z

Download citation

EndNote(RIS)

BibTeX

634

Views

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text Electronic supplementary material About this article

Abstract

Reconstructing dynamic scenes with commodity depth cameras has many applications in computer graphics, computer vision, and robotics. However, due to the presence of noise and erroneous observations from data capturing devices and the inherently ill-posed nature of non-rigid registration with insufficient information, traditional approaches often produce low-quality geometry with holes, bumps, and misalignments. We propose a novel 3D dynamic reconstruction system, named HDR-Net-Fusion, which learns to simultaneously reconstruct and refine the geometry on the fly with a sparse embedded deformation graph of surfels, using a hierarchical deep reinforcement (HDR) network. The latter comprises two parts: a global HDR-Net which rapidly detects local regions with large geometric errors, and a local HDR-Net serving as a local patch refinement operator to promptly complete and enhance such regions. Training the global HDR-Net is formulated as a novel reinforcement learning problem to implicitly learn the region selection strategy with the goal of improving the overall reconstruction quality. The applicability and efficiency of our approach are demonstrated using a large-scale dynamic reconstruction dataset. Our method can reconstruct geometry with higher quality than traditional methods.

Full text

Abstract

Full text

Outline

Electronic supplementary material

About this article

HDR-Net-Fusion: Real-time 3D dynamic scene reconstruction with a hierarchical deep reinforcement network

Show Author's information Hide Author's Information Hao-Xuan Song^¹, Jiahui Huang^¹, Yan-Pei Cao^², Tai-Jiang Mu^¹(

)

1BNRist, Department of Computer Science and Technology, Tsinghua University, Beiing 100084, China

2Kuaishou Technology Co., Ltd., Beijing 100085, China

Abstract

Keywords: deep reinforcement learning, dynamic 3D scene reconstruction, point cloud completion, deep neural networks

References(62)

[1]

Newcombe, R. A.; Davison, A. J.; Izadi, S.; Kohli, P.; Hilliges, O.; Shotton, J.; Hodges, S.; Fitzgibbon, A. W. KinectFusion: Real-time dense surface mapping and tracking. In: Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, 127-136, 2011.

DOI

[2]

Whelan, T.; McDonald, J. B.; M. Kaess, M.; M. F. Fallon, M. F.; Johannsson, H.; Leonard, J. J. Kintinuous: Spatially extended KinectFusion. In: Proceedings of the RSS Workshop on RGB-D: Advanced Reasoning with Depth Cameras, 2012.

[3]

Nießner, M.; Zollhöfer, M.; Izadi, S.; Stamminger, M. Real-time 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics Vol. 32, No. 6, Article No. 169, 2013.

DOI Google Scholar

[4]

Liu, Z. N.; Cao, Y. P.; Kuang, Z. F.; Kobbelt, L.; Hu, S. M. High-quality textured 3D shape reconstruction with cascaded fully convolutional networks. IEEE Transactions on Visualization and Computer Graphics Vol. 27, No. 1, 83-97, 2021.

DOI Google Scholar

[5]

Dou, M. S.; Khamis, S.; Degtyarev, Y.; Davidson, P.; Fanello, S. R.; Kowdle, A.; Orts-Escolano, S.; Rhemann, C.; Kim, D.; Taylor, J. et al. Fusion4D: Real-time performance capture of challenging scenes. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 114, 2016.

DOI Google Scholar

[6]

Dou, M. S.; Davidson, P.; Fanello, S. R.; Khamis, S.; Kowdle, A.; Rhemann, C.; Tankovich, V.; Izadi, S. Motion2fusion. ACM Transactions on Graphics Vol. 36, No. 6, Article No. 246, 2017.

DOI Google Scholar

[7]

Božič, A.; Zollhöfer, M.; Theobalt, C.; Nießner, M. DeepDeform: Learning non-rigid RGB-D reconstruc-tion with semi-supervised data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7000-7010, 2020.

[8]

Chen, K.; Lai, Y. K.; Hu, S. M. 3D indoor scene modeling from RGB-D data: A survey. Computational Visual Media Vol. 1, No. 4, 267-278, 2015.

DOI Google Scholar

[9]

Rünz, M.; Agapito, L. Co-fusion: Real-time segmentation, tracking and fusion of multiple objects. In: Proceedings of the IEEE International Conference on Robotics and Automation, 4471-4478, 2017.

DOI

[10]

Runz, M.; Buffier, M.; Agapito, L. MaskFusion: Real-time recognition, tracking and reconstruction of multiple moving objects. In: Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, 10-20, 2018.

DOI

[11]

Huang, J. H.; Yang, S.; Zhao, Z. S.; Lai, Y. K.; Hu, S. M. ClusterSLAM: A SLAM backend for simultaneous rigid body clustering and motion estimation. Computational Visual Media Vol. 7, No. 1, 87-101, 2021.

DOI Google Scholar

[12]

Huang, J. H.; Yang, S.; Mu, T. J.; Hu, S. M. ClusterVO: Clustering moving instances and estimating visual odometry for self and surroundings. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2165-2174, 2020.

DOI

[13]

Du, Z. J.; Huang, S. S.; Mu, T. J.; Zhao, Q. H.; Martin, R.; Xu, K. Accurate dynamic SLAM using CRF-based long-term consistency. IEEE Transactions on Visualization and Computer Graphics , 2020.

DOI Google Scholar

[14]

Newcombe, R. A.; Fox, D.; Seitz, S. M. DynamicFusion: Reconstruction and tracking of non-rigid scenes in real-time. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 343-352, 2015.

DOI

[15]

Brown, B. J.; Rusinkiewicz, S. Global non-rigid alignment of 3-D scans. ACM Transactions on Graphics Vol. 26, No. 3, 21-es, 2007.

DOI Google Scholar

[16]

Orts-Escolano, S.; Rhemann, C.; Fanello, S. R.; Chang, W.; Kowdle, A.; Degtyarev, Y.; Kim, D.; Davidson, P. L.; Khamis, S.; Dou, M. V. et al. Holoportation: Virtual 3D teleportation in real-time. In: Proceedings of the 29th Annual Symposium on User Interface Software and Technology, 741-754, 2016.

DOI

[17]

Guo, K. W.; Lincoln, P.; Davidson, P.; Busch, J.; Yu, X. M.; Whalen, M.; Harvey, G.; Orts-Escolano, S.; Pandey, R.; Dourgarian, J. et al. The relightables: Volumetric performance capture of humans with realistic relighting. ACM Transactions on Graphics Vol. 38, No. 6, Article No. 217, 2019.

DOI Google Scholar

[18]

Yu, T.; Guo, K. W.; Xu, F.; Dong, Y.; Su, Z. Q.; Zhao, J. H.; Li, J.; Dai, Q.; Liu, Y. BodyFusion: Real-time capture of human motion and surface geometry using a single depth camera. In: Proceedings of the IEEE International Conference on Computer Vision, 910-919, 2017.

DOI

[19]

Yu, T.; Zhao, J. H.; Zheng, Z. R.; Guo, K. W.; Dai, Q. H.; Li, H.; Pons-Moll, G.; Liu, Y. DoubleFusion: Real-time capture of human performances with inner body shapes from a single depth sensor. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 10, 2523-2539, 2020.

DOI Google Scholar

[20]

Meerits, S.; Thomas, D.; Nozick, V.; Saito, H.FusionMLS: Highly dynamic 3D reconstruction with consumer-grade RGB-D cameras. Computational Visual Media Vol. 4, No. 4, 287-303, 2018.

DOI Google Scholar

[21]

Fujiwara, K.; Nishino, K.; Takamatsu, J.; Zheng, B.; Ikeuchi, K. Locally rigid globally non-rigid surface registration. In: Proceedings of the International Conference on Computer Vision, 1527-1534, 2011.

DOI

[22]

Park, J. J.; Florence, P.; Straub, J.; Newcombe, R.; Lovegrove, S. DeepSDF: Learning continuous signed distance functions for shape representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 165-174, 2019.

DOI

[23]

Yuan, W. T.; Khot, T.; Held, D.; Mertz, C.; Hebert, M. PCN: Point completion network. In: Proceedings of the International Conference on 3D Vision, 728-737, 2018.

DOI

[24]

Wang, Y.; Solomon, J. Deep closest point: Learningrepresentations for point cloud registration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3522-3531, 2019.

DOI

[25]

Gojcic, Z.; Zhou, C. F.; Wegner, J. D.; Guibas, L. J.; Birdal, T. Learning multiview 3D point cloud registration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1756-1766, 2020.

DOI

[26]

Gu, X. Y.; Wang, Y. J.; Wu, C. R.; Lee, Y. J.; Wang, P. Q. HPLFlowNet: Hierarchical permutohedral lattice FlowNet for scene flow estimation on large-scale point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3249-3258, 2019.

[27]

Liu, X. Y.; Qi, C. R.; Guibas, L. J. FlowNet3D: Learning scene flow in 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 529-537, 2019.

DOI

[28]

Xiao, Y. P.; Lai, Y. K.; Zhang, F. L.; Li, C. P.; Gao, L. A survey on deep geometry learning: From a representation perspective. Computational Visual Media Vol. 6, No. 2, 113-133, 2020.

DOI Google Scholar

[29]

Li, R. H.; Li, X. Z.; Fu, C. W.; Cohen-Or, D.; Heng, P. A. PU-GAN: A point cloud upsampling adversarial network. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7202-7211, 2019.

[30]

Keller, M.; Lefloch, D.; Lambers, M.; Izadi, S.; Weyrich, T.; Kolb, A. Real-time 3D reconstruction in dynamic scenes using point-based fusion. In: Proceedings of the International Conference on 3D Vision, 1-8, 2013.

DOI

[31]

Garcia Cifuentes, C.; Issac, J.; Wüthrich, M.; Schaal, S.; Bohg, J. Probabilistic articulated real-time tracking for robot manipulation. IEEE Robotics and Automation Letters Vol. 2, No. 2, 577-584, 2017.

DOI Google Scholar

[32]

Tzionas, D.; Gall, J. Reconstructing articulated rigged models from RGB-D videos. In: Computer Vision - ECCV 2016 Workshops. Lecture Notes in Computer Science, Vol. 9915. Hua, G.; Jégou, H. Eds. Springer Cham, 620-633, 2016.

DOI

[33]

Taylor, J.; Bordeaux, L.; Cashman, T.; Corish, B.; Keskin, C.; Sharp, T.; Soto, E.; Sweeney, D.; Valentin, J. P. C.; Luff, B. et al. Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics Vol. 35, No. 4, Article No. 143, 2016.

DOI Google Scholar

[34]

Schmidt, T.; Newcombe, R.; Fox, D. DART: Dense articulated real-time tracking with consumer depth cameras. Autonomous Robots Vol. 39, No. 3, 239-258, 2015.

DOI Google Scholar

[35]

Innmann, M.; Zollhöfer, M.; Nießner, M.; Theobalt, C.; Stamminger, M. VolumeDeform: Real-time volumetric non-rigid reconstruction. In: Computer Vision - ECCV 2016. Lecture Notes in Computer Science, Vol. 9912. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 362-379, 2016.

DOI

[36]

Guo, K. W.; Xu, F.; Yu, T.; Liu, X. Y.; Dai, Q. H.; Liu, Y. B. Real-time geometry, albedo and motion reconstruction using a single RGBD camera. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 32, 2017.

DOI Google Scholar

[37]

Gao, W.; Tedrake, R. SurfelWarp: Efficient non-volumetric single view dynamic reconstruction. In: Proceedings of Robotics: Science and Systems, 2018.

DOI

[38]

Slavcheva, M.; Baust, M.; Ilic, S. SobolevFusion: 3D reconstruction of scenes undergoing free non-rigid motion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2646-2655, 2018.

DOI

[39]

Slavcheva, M.; Baust, M.; Cremers, D.; Ilic, S. KillingFusion: Non-rigid 3D reconstruction without correspondences. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5474-5483, 2017.

DOI

[40]

Guo, K. W.; Taylor, J.; Fanello, S.; Tagliasacchi, A.; Dou, M. S.; Davidson, P.; Kowdle, A.; Izadi, S. TwinFusion: High framerate non-rigid fusion through fast correspondence tracking. In: Proceedings of the International Conference on 3D Vision, 596-605, 2018.

DOI

[41]

Zollhöfer, M.; Stotko, P.; Görlitz, A.; Theobalt, C.; Nießner, M.; Klein, R.; Kolb, A. State of the art on 3D reconstruction with RGB-D cameras. Computer Graphics Forum Vol. 37, No. 2, 625-652, 2018.

DOI Google Scholar

[42]

Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G. et al. Human-level control through deep reinforcement learning. Nature Vol. 518, No. 7540, 529-533, 2015.

DOI Google Scholar

[43]

Charles, R. Q.; Hao, S.; Mo, K. C.; Guibas, L. J. PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 77-85, 2017.

DOI

[44]

Qi, C. R.; Yi, L.; Su, H.; Guibas, L. J. PointNet++: Deephierarchical feature learning on point sets in a metric space. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 5099-5108, 2017.

[45]

Wang, Y.; Sun, Y. B.; Liu, Z. W.; Sarma, S. E.; Bronstein, M. M.; Solomon, J. M. Dynamic graph CNN for learning on point clouds. ACM Transactions on Graphics Vol. 38, No. 5, Article No. 146, 2019.

DOI Google Scholar

[46]

Wu, W. X.; Qi, Z.; Fuxin, L. PointConv: Deep convolutional networks on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9613-9622, 2019.

[47]

Fan, H. Q.; Su, H.; Guibas, L. A point set generation network for 3D object reconstruction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2463-2471, 2017.

DOI

[48]

Yang, Y. Q.; Feng, C.; Shen, Y. R.; Tian, D. FoldingNet: Point cloud auto-encoder via deep griddeformation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 206-215, 2018.

DOI

[49]

Groueix, T.; Fisher, M.; Kim, V. G.; Russell, B. C.; Aubry, M. A Papier-Mache approach to learning 3D surface generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 216-224, 2018.

DOI

[50]

Yifan, W.; Wu, S. H.; Huang, H.; Cohen-Or, D.; Sorkine-Hornung, O. Patch-based progressive 3D point set upsampling. In: Proceedings of the: IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5951-5960, 2019.

DOI

[51]

Tchapmi, L. P.; Kosaraju, V.; Rezatofighi, H.; Reid, I.; Savarese, S. TopNet: Structural point cloud decoder. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 383-392, 2019.

DOI

[52]

Arulkumaran, K.; Deisenroth, M. P.; Brundage, M.; Bharath, A. A. A brief survey of deep reinforcement learning. arXiv preprint arXiv: 1708.05866, 2017.

DOI Google Scholar

[53]

Peng, X. B.; Abbeel, P.; Levine, S.; van de Panne, M. DeepMimic: Example-guided deep reinforcement learning of physics-based character skills. ACM Transactions on Graphics Vol. 37, No. 4, Article No. 143, 2018.

DOI Google Scholar

[54]

Zhu, Y. K.; Mottaghi, R.; Kolve, E.; Lim, J. J.; Gupta, A.; Fei-Fei, L.; Farhadi, A. Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: Proceedings of the IEEE International Conference on Robotics and Automation, 3357-3364, 2017.

DOI

[55]

Van Hasselt, H.; Guez, A.; Silver, D. Deep reinforcementlearning with double Q-learning. In: Proceedings of the 13th AAAI Conference on Artificial Intelligence, 2094-2100, 2016.

[56]

Wang, Z.; Schaul, T.; Hessel, M.; van Hasselt, H.; Lanctot, M.; de Freitas, N. Dueling network architectures for deep reinforcement learning. In: Proceedings of the 33rd International Conference on Machine Learning, 1995-2003, 2016.

[57]

Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. In: Proceedings of the International Conference on Learning Representations, 2016.

[58]

Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T. P.; Harley, T.; Silver, D.; Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In: Proceedings of the 33rd International Conference on International Conference on Machine Learning, Vol. 48, 1928-1937, 2016.

[59]

Kavan, L.; Collins, S.; Žára, J.; O’Sullivan, C. Geometric skinning with approximate dual quaternion blending. ACM Transactions on Graphics Vol. 27, No. 4, Article No. 105, 2008.

DOI Google Scholar

[60]

Wang, S. L.; Fanello, S. R.; Rhemann, C.; Izadi, S.; Kohli, P. The global patch collider. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 127-135, 2016.

DOI

[61]

Collet, A.; Chuang, M.; Sweeney, P.; Gillett, D.; Evseev, D.; Calabrese, D.; Hoppe, H.; Kirk, A. G.; Sullivan, S. High-quality streamable free-viewpoint video. ACM Transactions on Graphics Vol. 34, No. 4, Article No. 69, 2015.

DOI Google Scholar

[62]

Guo, M. H.; Cai, J. X.; Liu, Z. N.; Mu, T. J.; Martin, R. R.; Hu, S. M. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187-199, 2021.

DOI Google Scholar

Electronic supplementary material

Video

CVM_2021_4_419-435_ESM.mp4

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 01 March 2021

Accepted: 27 March 2021

Published: 05 August 2021

Issue date: December 2021

Copyright

Acknowledgements

We thank the anonymous reviewers for their helpful comments on this paper. This work was supported by the National Natural Science Foundation of China (Grant Nos. 61902210 and 61521002).

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.