Automatic object annotation in streamed and remotely explored large 3D reconstructions

Benjamin Höller; Annette Mossel; Hannes Kaufmann

doi:10.1007/s41095-020-0194-4

Computational Visual Media 2021, 7(1): 71-86 https://doi.org/10.1007/s41095-020-0194-4

Research Article |

Open Access | Issue | Published: 07 January 2021

Automatic object annotation in streamed and remotely explored large 3D reconstructions

Show Author's Information Hide Author's Information Benjamin Höller^¹, Annette Mossel^¹, Hannes Kaufmann^¹(

)

1Institute of Visual Computing and Human-Centered Technology, Vienna University of Technology, Favoritenstraße 9-11/193/06, A-1040 Vienna, Austria

Keywords:

object detection, CNN, dense 3D reconstruction, distributed virtual reality

Cite this article:

Höller B, Mossel A, Kaufmann H. Automatic object annotation in streamed and remotely explored large 3D reconstructions. Computational Visual Media, 2021, 7(1): 71-86. https://doi.org/10.1007/s41095-020-0194-4

Download citation

EndNote(RIS)

BibTeX

589

Views

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text Electronic supplementary material About this article

Abstract

We introduce a novel framework for 3Dscene reconstruction with simultaneous object annotation, using a pre-trained 2D convolutional neural network (CNN), incremental data streaming,and remote exploration, with a virtual reality setup. It enables versatile integration of any 2D box detection or segmentation network. We integrate new approaches to (i) asynchronously perform dense 3D-reconstruction and object annotation at interactive frame rates,(ii) efficiently optimize CNN results in terms of object prediction and spatial accuracy, and (iii) generate computationally-efficient colliders in large triangulated 3D-reconstructions at run-time for 3D scene interaction. Our method is novel in combining CNNs with long and varying inference time with live 3D-reconstruction from RGB-D camera input. We further propose a lightweight data structure to store the 3D-reconstruction data and object annotations to enable fast incremental data transmission for real-time exploration with a remote client, which has not been presented before. Our framework achieves update rates of 22 fps (SSD Mobile Net) and 19 fps (Mask RCNN) for indoor environments up to 800 m $^{3}$ . We evaluated the accuracy of 3D-object detection. Our work provides a versatile foundation for semantic scene understanding of large streamed 3D-reconstructions, while being independent from the CNN’s processing time. Source code is available for non-commercial use.

Full text

Abstract

Full text

Outline

Electronic supplementary material

About this article

Automatic object annotation in streamed and remotely explored large 3D reconstructions

Show Author's information Hide Author's Information Benjamin Höller^¹, Annette Mossel^¹, Hannes Kaufmann^¹(

)

1Institute of Visual Computing and Human-Centered Technology, Vienna University of Technology, Favoritenstraße 9-11/193/06, A-1040 Vienna, Austria

Abstract

Keywords: object detection, CNN, dense 3D reconstruction, distributed virtual reality

References(34)

[1]

Mossel, A.; Kroeter, M. Streaming and exploration of dynamically changing dense 3D reconstructions in immersive virtual reality. In: Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, 43-48, 2016.

DOI

[2]

Ruddle, R. A.; Lessels, S. The benefits of using a walking interface to navigate virtual environments. ACM Transactions on Computer-Human Interaction Vol. 16, No. 1, Article No. 5, 2009.

DOI Google Scholar

[3]

Sünderhauf, N.; Pham, T. T.; Latif, Y.; Milford M.; Reid, I. Meaningful maps with object-oriented semantic mapping. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 5079-5085, 2017.

DOI

[4]

Kammerl, J.; Blodow, N.; Rusu, R. B.; Gedikli, S.; Beetz, M.; Steinbach, E. Real-time compression of point cloud streams. In: Proceedings of the IEEE International Conference on Robotics and Automation, 778-785, 2012.

DOI

[5]

Golla, T.; Klein, R. Real-time point cloud compression. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 5087-5092, 2015.

DOI

[6]

Morell, V.; Orts, S.; Cazorla, M.; Garcia-Rodriguez, J. Geometric 3D point cloud compression. Pattern Recognition Letters Vol. 50, 55-62, 2014.

DOI Google Scholar

[7]

Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 580-587, 2014.

DOI

[8]

Girshick, R. Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 1440-1448, 2015.

DOI

[9]

Ren, S.; He, K.; Girshick, R.; Sun, J.; Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137-1149, 2017.

DOI Google Scholar

[10]

Redmon, J.; Farhadi, A. Yolo9000: Better, faster, stronger. In: Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, 7263-7271, 2017.

DOI

[11]

Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C. Y.; Berg, A. C. SSD: Single shot MultiBox detector. In: Computer Vision-ECCV 2016. Lecture Notes in Computer Science, Vol. 9905. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 21-37, 2016.

DOI

[12]

He, K.; Gkioxari, G.; Dollfiar, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2961-2969, 2017.

DOI

[13]

McCormac, J.; Handa, A.; Davison, A.; Leutenegger, S. Semanticfusion: Dense 3D semantic mapping with convolutional neural networks. In: Proceedings of the IEEE International Conference on Robotics and automation, 4628-4635, 2017.

DOI

[14]

Whelan, T.; Leutenegger, S.; Salas Moreno, R.; Glocker, B.; Davison, A. ElasticFusion: Dense SLAM without a pose graph. In: Proceedings of the Robotics: Science and Systems, 2015.

DOI

[15]

Runz, M.; Bufier, M.; Agapito, L. MaskFusion: Real-time recognition, tracking and reconstruction of multiple moving objects. In: Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, 10-20, 2018.

DOI

[16]

He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2980-2988, 2017.

DOI

[17]

Nakajima, Y.; Saito, H. Efficient object-oriented semantic mapping with object detector. IEEE Access Vol. 7, 3206-3213, 2019.

DOI Google Scholar

[18]

Prisacariu, V. A.; Kähler, O.; Golodetz, S.; Sapienza, M.; Cavallari, T.; Torr, P. H.; Murray, D. W. InfiniTAM v3: A framework for large-scale 3D reconstruction with loop closure. arXiv preprint arXiv:1708.00783, 2017.

Google Scholar

[19]

Tateno, K.; Tombari, F.; Navab, N. Realtime and scalable incremental segmentation on dense slam. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 4465-4472, 2015.

DOI

[20]

Wang, L.; Li, R.; Sun, J.; Liu, X.; Zhao, L.; Seah, H. S.; Quah, C. K.; Tandianus, B. Multi-view fusion-based 3D object detection for robot indoor scene perception. Sensors Vol. 19, No. 19, 4092, 2019.

DOI Google Scholar

[21]

Hou, J.; Dai, A.; Niefiner, M. 3D-sis: 3D semanticinstance segmentation of RGB-D scans. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition, 4421-4430, 2019.

DOI

[22]

Prisacariu, V. A.; Kähler, O.; Cheng, M. M.; Ren, C. Y.; Valentin, J.; Torr, P. H.; Reid, I. D.; Murray, D. W. A framework for the volumetric integration of depth images. arXiv preprint arXiv:1410.0925, 2014.

Google Scholar

[23]

Nießner, M.; Zollhöfer, M.; Izadi, S.; Stamminger, M. Real-time 3D reconstruction at scale using voxel hashing. ACM Transactions on Graphics Vol. 32, No. 6, Article No. 169, 2013.

DOI Google Scholar

[24]

Nickolls, J.; Buck, I.; Garland, M.; Skadron, K. Scalable parallel programming with CUDA. Queue Vol. 6, No. 2, 40-53, 2008.

DOI Google Scholar

[25]

Newcombe, R. A.; Izadi, S.; Hilliges, O.; Molyneaux, D.; Kim, D.; Davison, A. J.; Kohi, P.; Shotton, J.; Hodges, S.; Fitzgibbon, A. KinectFusion: Real-time dense surface mapping and tracking. In: Proceedings of the 10th IEEE International Symposium on Mixed and Augmented Reality, 127-136, 2011.

DOI

[26]

Deutsch, P. DEFLATE Compressed Data Format Specification version 1.3. RFC 1951. . 1996.

DOI Google Scholar

[27]

Lorensen, W. E.; Cline, H. E. Marching cubes: A high resolution 3D surface construction algorithm. ACM SIGGRAPH Computer Graphics Vol. 21, No. 4, 163-169, 1987.

DOI Google Scholar

[28]

Kahler, O.; Adrian Prisacariu, V.; Yuheng Ren, C.; Sun, X.; Torr, P., Murray, D. Very high frame rate volumetric integration of depth images on mobile devices. IEEE Transactions on Visualization and Computer Graphics Vol. 21, No. 11, 1241-1250, 2015.

DOI Google Scholar

[29]

Howard, A. G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

Google Scholar

[30]

Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C. Y.; Berg, A. C. SSD mobilenet v2 COCO 2018 03 29. Available at https://github.com/opencv/opencv\fiextra/blob/master/testdata/dnn/ssd\fimobilenet\fiv2\ficoco\fi2018\fi03\fi29.pbtxt.

[31]

Abdulla, W. Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. 2017. Avaiblable at https://github.com/matterport/Mask_RCNN.

[32]

Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollfiar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision-ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740-755, 2014.

DOI

[33]

Zhou, Q. Y.; Koltun, V. Dense scene reconstruction with points of interest. ACM Transactions on Graphics Vol. 32, No. 4, Article No. 112, 2013.

DOI Google Scholar

[34]

Silberman, N.; Hoiem, D.; Kohli, P.; Fergus, R. Indoor segmentation and support inference from RGBD images. In: Computer Vision-ECCV 2012. Lecture Notes in Computer Science, Vol. 7576. Fitzgibbon, A.; Lazebnik, S.; Perona, P.; Sato, Y.; Schmid, C. Eds. Springer Berlin Heidelberg, 746-760, 2012.

DOI

Electronic supplementary material

Video

41095_2020_194_MOESM1_ESM.mp4

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 31 August 2020

Accepted: 06 September 2020

Published: 07 January 2021

Issue date: March 2021

Copyright

Acknowledgements

The work was solely supported by Vienna University of Technology.

Rights and permissions

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.