Journal Home > Volume 7 , Issue 1

We introduce a novel framework for 3Dscene reconstruction with simultaneous object annotation, using a pre-trained 2D convolutional neural network (CNN), incremental data streaming,and remote exploration, with a virtual reality setup. It enables versatile integration of any 2D box detection or segmentation network. We integrate new approaches to (i) asynchronously perform dense 3D-reconstruction and object annotation at interactive frame rates,(ii) efficiently optimize CNN results in terms of object prediction and spatial accuracy, and (iii) generate computationally-efficient colliders in large triangulated 3D-reconstructions at run-time for 3D scene interaction. Our method is novel in combining CNNs with long and varying inference time with live 3D-reconstruction from RGB-D camera input. We further propose a lightweight data structure to store the 3D-reconstruction data and object annotations to enable fast incremental data transmission for real-time exploration with a remote client, which has not been presented before. Our framework achieves update rates of 22 fps (SSD Mobile Net) and 19 fps (Mask RCNN) for indoor environments up to 800 m 3. We evaluated the accuracy of 3D-object detection. Our work provides a versatile foundation for semantic scene understanding of large streamed 3D-reconstructions, while being independent from the CNN’s processing time. Source code is available for non-commercial use.


menu
Abstract
Full text
Outline
Electronic supplementary material
About this article

Automatic object annotation in streamed and remotely explored large 3D reconstructions

Show Author's information Benjamin Höller1Annette Mossel1Hannes Kaufmann1( )
Institute of Visual Computing and Human-Centered Technology, Vienna University of Technology, Favoritenstraße 9-11/193/06, A-1040 Vienna, Austria

Abstract

We introduce a novel framework for 3Dscene reconstruction with simultaneous object annotation, using a pre-trained 2D convolutional neural network (CNN), incremental data streaming,and remote exploration, with a virtual reality setup. It enables versatile integration of any 2D box detection or segmentation network. We integrate new approaches to (i) asynchronously perform dense 3D-reconstruction and object annotation at interactive frame rates,(ii) efficiently optimize CNN results in terms of object prediction and spatial accuracy, and (iii) generate computationally-efficient colliders in large triangulated 3D-reconstructions at run-time for 3D scene interaction. Our method is novel in combining CNNs with long and varying inference time with live 3D-reconstruction from RGB-D camera input. We further propose a lightweight data structure to store the 3D-reconstruction data and object annotations to enable fast incremental data transmission for real-time exploration with a remote client, which has not been presented before. Our framework achieves update rates of 22 fps (SSD Mobile Net) and 19 fps (Mask RCNN) for indoor environments up to 800 m 3. We evaluated the accuracy of 3D-object detection. Our work provides a versatile foundation for semantic scene understanding of large streamed 3D-reconstructions, while being independent from the CNN’s processing time. Source code is available for non-commercial use.

Keywords: object detection, CNN, dense 3D reconstruction, distributed virtual reality

References(34)

[1]
Mossel, A.; Kroeter, M. Streaming and exploration of dynamically changing dense 3D reconstructions in immersive virtual reality. In: Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, 43-48, 2016.
DOI
[2]
Ruddle, R. A.; Lessels, S. The benefits of using a walking interface to navigate virtual environments. ACM Transactions on Computer-Human Interaction Vol. 16, No. 1, Article No. 5, 2009.
[3]
Sünderhauf, N.; Pham, T. T.; Latif, Y.; Milford M.; Reid, I. Meaningful maps with object-oriented semantic mapping. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 5079-5085, 2017.
DOI
[4]
Kammerl, J.; Blodow, N.; Rusu, R. B.; Gedikli, S.; Beetz, M.; Steinbach, E. Real-time compression of point cloud streams. In: Proceedings of the IEEE International Conference on Robotics and Automation, 778-785, 2012.
DOI
[5]
Golla, T.; Klein, R. Real-time point cloud compression. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 5087-5092, 2015.
DOI
[6]
Morell, V.; Orts, S.; Cazorla, M.; Garcia-Rodriguez, J. Geometric 3D point cloud compression. Pattern Recognition Letters Vol. 50, 55-62, 2014.
[7]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 580-587, 2014.
DOI
[8]
Girshick, R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 1440-1448, 2015.
DOI
[9]
Ren, S.; He, K.; Girshick, R.; Sun, J.; Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137-1149, 2017.
[10]
Redmon, J.; Farhadi, A. Yolo9000: Better, faster, stronger. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, 7263-7271, 2017.
DOI
[11]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C. Y.; Berg, A. C. SSD: Single shot MultiBox detector. In: Computer Vision-ECCV 2016. Lecture Notes in Computer Science, Vol. 9905. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 21-37, 2016.
DOI
[12]
He, K.; Gkioxari, G.; Dollfiar, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2961-2969, 2017.
DOI
[13]
McCormac, J.; Handa, A.; Davison, A.; Leutenegger, S. Semanticfusion: Dense 3D semantic mapping with convolutional neural networks. In: Proceedings of the IEEE International Conference on Robotics and automation, 4628-4635, 2017.
DOI
[14]
Whelan, T.; Leutenegger, S.; Salas Moreno, R.; Glocker, B.; Davison, A. ElasticFusion: Dense SLAM without a pose graph. In: Proceedings of the Robotics: Science and Systems, 2015.
DOI
[15]
Runz, M.; Bufier, M.; Agapito, L. MaskFusion: Real-time recognition, tracking and reconstruction of multiple moving objects. In: Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, 10-20, 2018.
DOI
[16]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2980-2988, 2017.
DOI
[17]
Nakajima, Y.; Saito, H. Efficient object-oriented semantic mapping with object detector. IEEE Access Vol. 7, 3206-3213, 2019.
[18]
Prisacariu, V. A.; Kähler, O.; Golodetz, S.; Sapienza, M.; Cavallari, T.; Torr, P. H.; Murray, D. W. InfiniTAM v3: A framework for large-scale 3D reconstruction with loop closure. arXiv preprint arXiv:1708.00783, 2017.
[19]
Tateno, K.; Tombari, F.; Navab, N. Realtime and scalable incremental segmentation on dense slam. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 4465-4472, 2015.
DOI
[20]
Wang, L.; Li, R.; Sun, J.; Liu, X.; Zhao, L.; Seah, H. S.; Quah, C. K.; Tandianus, B. Multi-view fusion-based 3D object detection for robot indoor scene perception. Sensors Vol. 19, No. 19, 4092, 2019.
[21]
Hou, J.; Dai, A.; Niefiner, M. 3D-sis: 3D semanticinstance segmentation of RGB-D scans. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition, 4421-4430, 2019.
DOI
[22]
Prisacariu, V. A.; Kähler, O.; Cheng, M. M.; Ren, C. Y.; Valentin, J.; Torr, P. H.; Reid, I. D.; Murray, D. W. A framework for the volumetric integration of depth images. arXiv preprint arXiv:1410.0925, 2014.
[23]
Nießner, M.; Zollhöfer, M.; Izadi, S.; Stamminger, M. Real-time 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics Vol. 32, No. 6, Article No. 169, 2013.
[24]
Nickolls, J.; Buck, I.; Garland, M.; Skadron, K. Scalable parallel programming with CUDA. Queue Vol. 6, No. 2, 40-53, 2008.
[25]
Newcombe, R. A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A. J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. KinectFusion: Real-time dense surface mapping and tracking. In: Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, 127-136, 2011.
DOI
[26]
Deutsch, P. DEFLATE Compressed Data Format Specification version 1.3. RFC 1951. . 1996.
[27]
Lorensen, W. E.; Cline, H. E. Marching cubes: A high resolution 3D surface construction algorithm. ACM SIGGRAPH Computer Graphics Vol. 21, No. 4, 163-169, 1987.
[28]
Kahler, O.; Adrian Prisacariu, V.; Yuheng Ren, C.; Sun, X.; Torr, P., Murray, D. Very high frame rate volumetric integration of depth images on mobile devices. IEEE Transactions on Visualization and Computer Graphics Vol. 21, No. 11, 1241-1250, 2015.
[29]
Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
[30]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C. Y.; Berg, A. C. SSD mobilenet v2 COCO 2018 03 29. Available at https://github.com/opencv/opencv\fiextra/blob/master/testdata/dnn/ssd\fimobilenet\fiv2\ficoco\fi2018\fi03\fi29.pbtxt.
[31]
Abdulla, W. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. 2017. Avaiblable at https://github.com/matterport/Mask_RCNN.
[32]
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollfiar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision-ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740-755, 2014.
DOI
[33]
Zhou, Q. Y.; Koltun, V. Dense scene reconstruction with points of interest. ACM Transactions on Graphics Vol. 32, No. 4, Article No. 112, 2013.
[34]
Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from RGBD images. In: Computer Vision-ECCV 2012. Lecture Notes in Computer Science, Vol. 7576. Fitzgibbon, A.; Lazebnik, S.; Perona, P.; Sato, Y.; Schmid, C. Eds. Springer Berlin Heidelberg, 746-760, 2012.
DOI
Video
41095_2020_194_MOESM1_ESM.mp4
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 31 August 2020
Accepted: 06 September 2020
Published: 07 January 2021
Issue date: March 2021

Copyright

© The Author(s) 2020

Acknowledgements

The work was solely supported by Vienna University of Technology.

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.

Return