ARM3D: Attention-based relation module for indoor 3D object detection

Yuqing Lan; Yao Duan; Chenyi Liu; Chenyang Zhu; Yueshan Xiong; Hui Huang; Kai Xu

doi:10.1007/s41095-021-0252-6

Computational Visual Media 2022, 8(3): 395-414 https://doi.org/10.1007/s41095-021-0252-6

Research Article |

Open Access | Issue | Published: 08 March 2022

ARM3D: Attention-based relation module for indoor 3D object detection

Show Author's Information Hide Author's Information Yuqing Lan^¹, Yao Duan^¹, Chenyi Liu^¹, Chenyang Zhu^¹, Yueshan Xiong^¹, Hui Huang^², Kai Xu^¹(

)

1College of Computer, National University of Defense Technology, Changsha 410073, China

2Shenzhen University, Shenzhen 518061, China

Keywords:

scene understanding, attention mechanism, relational reasoning, 3D indoor object detection

Cite this article:

Lan Y, Duan Y, Liu C, et al. ARM3D: Attention-based relation module for indoor 3D object detection. Computational Visual Media, 2022, 8(3): 395-414. https://doi.org/10.1007/s41095-021-0252-6

Download citation

EndNote(RIS)

BibTeX

797

Views

Downloads

Citations

Crossref

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Relation contexts have been proved to be useful for many challenging vision tasks. In the field of 3D object detection, previous methods have been taking the advantage of context encoding, graph embedding, orexplicit relation reasoning to extract relation contexts. However, there exist inevitably redundant relation contexts due to noisy or low-quality proposals. In fact, invalid relation contexts usually indicate underlying scene misunderstanding and ambiguity, which may, on the contrary, reduce the performance in complex scenes. Inspired by recent attention mechanism like Transformer, we propose a novel 3D attention-based relation module (ARM3D). It encompasses object-aware relation reasoning to extract pair-wise relation contexts among qualified proposals and an attention module to distribute attention weights towards different relation contexts. In this way, ARM3D can take full advantage of the useful relation contexts and filter those less relevant or even confusing contexts, which mitigates the ambiguity in detection. We have evaluated the effectiveness of ARM3D by plugging it into several state-of-the-art 3D object detectors and showing more accurate and robust detection results. Extensive experiments show the capability and generalization of ARM3D on 3D object detection. Our source code is available at https://github.com/lanlan96/ARM3D.

Full text

Abstract

Full text

Outline

About this article

ARM3D: Attention-based relation module for indoor 3D object detection

Show Author's information Hide Author's Information Yuqing Lan^¹, Yao Duan^¹, Chenyi Liu^¹, Chenyang Zhu^¹, Yueshan Xiong^¹, Hui Huang^², Kai Xu^¹(

)

1College of Computer, National University of Defense Technology, Changsha 410073, China

2Shenzhen University, Shenzhen 518061, China

Abstract

Keywords: scene understanding, attention mechanism, relational reasoning, 3D indoor object detection

References(62)

[1]

Charles, R. Q.; Hao, S.; Mo, K. C.; Guibas, L. J. PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 77-85, 2017.

DOI

[2]

Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. PointCNN: Convolution on X-transformed points. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, 820-830, 2018.

[3]

Qi, C. R.; Yi, L.; Su, H.; Guibas, L. J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 5099-5108, 2017.

[4]

Wu, W. X.; Qi, Z. A.; Li, F. X. PointConv: Deep convolutional networks on 3D point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9613-9622, 2019.

[5]

Yi, L.; Zhao, W.; Wang, H.; Sung, M.; Guibas, L. J. GSPN: Generative shape proposal network for 3D instance segmentation in point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3942-3951, 2019.

DOI

[6]

Qi, C. R.; Litany, O.; He, K. M.; Guibas, L. Deep Hough voting for 3D object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9276-9285, 2019.

[7]

Xie, Q.; Lai, Y. K.; Wu, J.; Wang, Z. T.; Zhang, Y. M.; Xu, K.; Wang, J. MLCVNet: Multi-level context VoteNet for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10444-10453, 2020.

DOI

[8]

Zhang, Z.; Sun, B.; Yang, H.; Huang, Q. H3DNet: 3D object detection using hybrid geometric primitives. In: Computer Vision - ECCV 2020. Lecture Notes in Computer Science, Vol. 12357. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 311-329, 2020.

[9]

Cheng, B. W.; Sheng, L.; Shi, S. S.; Yang, M.; Xu, D. Back-tracing representative points for voting-based 3D object detection in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8959-8968, 2021.

DOI

[10]

Lan, Y. Q.; Duan, Y.; Shi, Y. F.; Huang, H.; Xu, K. 3DRM: Pair-wise relation module for 3D object detection. Computers & Graphics Vol. 98, 58-70, 2021.

DOI Google Scholar

[11]

Shi, Y. F.; Long, P. X.; Xu, K.; Huang, H.; Xiong, Y. S. Data-driven contextual modeling for 3D scene understanding. Computers & Graphics Vol. 55, 55-67, 2016.

DOI Google Scholar

[12]

Qi, X. J.; Liao, R. J.; Jia, J. Y.; Fidler, S.; Urtasun, R. 3D graph neural networks for RGBD semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, 5209-5218, 2017.

[13]

Zhang, Y.; Bai, M.; Kohli, P.; Izadi, S.; Xiao, J. DeepContext: Context-encoding neural pathways for 3D holistic scene understanding. In: Proceedings of the IEEE International Conference on Computer Vision, 1201-1210, 2017.

DOI

[14]

Hu, H.; Gu, J. Y.; Zhang, Z.; Dai, J. F.; Wei, Y. C. Relation networks for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3588-3597, 2018.

DOI

[15]

Xu, H.; Jiang, C. H.; Liang, X. D.; Li, Z. G. Spatial-aware graph relation network for large-scale object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9290-9299, 2019.

DOI

[16]

Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser, T.; Niessner, M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2432-2443, 2017.

DOI

[17]

Song, S. R.; Lichtenberg, S. P.; Xiao, J. X. SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 567-576, 2015.

DOI

[18]

Lin, D. H.; Fidler, S.; Urtasun, R. Holistic scene understanding for 3D object detection with RGBD cameras. In: Proceedings of the IEEE International Conference on Computer Vision, 1417-1424, 2013.

DOI

[19]

Shi, Y. F.; Chang, A. X.; Wu, Z. L.; Savva, M.; Xu, K. Hierarchy denoising recursive autoencoders for 3D scene layout prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1771-1780, 2019.

DOI

[20]

Chen, J. T.; Lei, B. W.; Song, Q. Y.; Ying, H. C.; Chen, D. Z.; Wu, J. A hierarchical graph network for 3D object detection on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 392-401, 2020.

DOI

[21]

Qi, C. R.; Liu, W.; Wu, C. X.; Su, H.; Guibas, L. J. Frustum PointNets for 3D object detection from RGB-D data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 918-927, 2018.

[22]

Chen, X. Z.; Ma, H. M.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1907-1915, 2017.

DOI

[23]

Ku, J.; Mozifian, M.; Lee, J.; Harakeh, A.; Waslander, S. L. Joint 3D proposal generation and object detection from view aggregation. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 1-8, 2018.

DOI

[24]

Shi, S. S.; Wang, X. G.; Li, H. S. PointRCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 770-779, 2019.

DOI

[25]

Wang, P.-S.; Liu, Y.; Guo, Y.-X.; Sun, C.-Y.; Tong, X. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 72, 2017.

DOI Google Scholar

[26]

Atzmon, M.; Maron, H.; Lipman, Y. Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091, 2018.

DOI Google Scholar

[27]

Yan, Y.; Mao, Y. X.; Li, B. SECOND: Sparsely embedded convolutional detection. Sensors (Basel) Vol. 18, No. 10, 3337, 2018.

DOI Google Scholar

[28]

Lang, A. H.; Vora, S.; Caesar, H.; Zhou, L. B.; Yang, J.; Beijbom, O. PointPillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12689-12697, 2019.

DOI

[29]

Shi, S. S.; Wang, Z.; Shi, J. P.; Wang, X. G.; Li, H. S. From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 8, 2647-2664, 2021.

DOI Google Scholar

[30]

Pang, G.; Neumann, U. 3D point cloud object detection with multi-view convolutional neural network. In: Proceedings of the 23rd International Conference on Pattern Recognition, 585-590, 2016.

DOI

[31]

Lahoud, J.; Ghanem, B. 2D-driven 3D object detection in RGB-D images. In: Proceedings of the IEEE International Conference on Computer Vision, 4632-4640, 2017.

DOI

[32]

Ren, S. Q.; He, K. M.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, 91-99, 2015.

[33]

Yang, Z. T.; Sun, Y. N.; Liu, S.; Jia, J. Y. 3DSSD: Point-based 3D single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11037-11045, 2020.

DOI

[34]

Engelmann, F.; Bokeloh, M.; Fathi, A.; Leibe, B.; NieBner, M. 3D-MPA: Multi-proposal aggregation for 3D semantic instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9028-9037, 2020.

DOI

[35]

Huang, S.; Qi, S.; Xiao, Y.; Zhu, Y.; Wu, Y. N.; Zhu, S.-C. Cooperative holistic scene understanding: Unifying 3D object, layout, and camera pose estimation. In: Proceedings of the 32nd Conference on Neural Information Processing System, 207-218, 2018.

[36]

Santoro, A.; Raposo, D.; Barrett, D. G.; Malinowski, M.; Pascanu, R.; Battaglia, P.; Lillicrap, T. A simple neural network module for relational reasoning. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 4967-4976, 2017.

[37]

Mou, L. C.; Hua, Y. S.; Zhu, X. X. A relation-augmented fully convolutional network for semantic segmentation in aerial scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12408-12417, 2019.

DOI

[38]

Li, X.; Yang, Y. B.; Zhao, Q. J.; Shen, T. C.; Lin, Z. C.; Liu, H. Spatial pyramid based graph reasoning for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8947-8956, 2020.

DOI

[39]

Chen, X. L.; Gupta, A. Spatial memory for context reasoning in object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 4086-4096, 2017.

DOI

[40]

Cui, Q. J.; Sun, H. J.; Yang, F. Learning dynamic relationships for 3D human motion prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6518-6526, 2020.

DOI

[41]

Huang, Y. F.; Sugano, Y.; Sato, Y. Improving action segmentation via graph-based temporal reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14021-14031, 2020.

DOI

[42]

Krishna, R.; Zhu, Y. K.; Groth, O.; Johnson, J.; Hata, K. J.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A.; et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision Vol. 123, No. 1, 32-73, 2017.

DOI Google Scholar

[43]

Liu, C. C.; Jin, Y.; Xu, K. H.; Gong, G. Q.; Mu, Y. D. Beyond short-term snippet: Video relation detection with spatio-temporal global context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10837-10846, 2020.

DOI

[44]

Cadene, R.; Ben-Younes, H.; Cord, M.; Thome, N. MUREL: Multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1989-1998, 2019.

DOI

[45]

Sung, F.; Yang, Y. X.; Zhang, L.; Xiang, T.; Torr, P. H. S.; Hospedales, T. M. Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1199-1208, 2018.

DOI

[46]

Wang, W. B.; Wang, R. P.; Shan, S. G.; Chen, X. L. Exploring context and visual pattern of relationship for scene graph generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8180-8189, 2019.

DOI

[47]

Huang, S. S.; Fu, H. B.; Hu, S. M. Structure guided interior scene synthesis via graph matching. Graphical Models Vol. 85, 46-55, 2016.

DOI Google Scholar

[48]

Song, P.; Zheng, Y.; Jia, J. Web3d learning platform of furniture layout based on case-based reasoning and distance field. In: E-Learning and Games. Lecture Notes in Computer Science, Vol. 10345. Tian, F.; Gatzidis, C.; El Rhalibi, A.; Tang, W.; Charles, F. Eds. Springer Cham, 235-250, 2017.

[49]

Duan, Y. Q.; Zheng, Y.; Lu, J. W.; Zhou, J.; Tian, Q. Structural relational reasoning of point clouds. In: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, 949-958, 2019.

DOI

[50]

Kulkarni, N.; Misra, I.; Tulsiani, S.; Gupta, A. 3D-RelNet: Joint object and relational network for 3D prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2212-2221, 2019.

DOI

[51]

Li, Y.; Ma, L. F.; Tan, W. K.; Sun, C.; Cao, D. P.; Li, J. GRNet: Geometric relation network for 3D object detection from point clouds. ISPRS Journal of Photogrammetry and Remote Sensing Vol. 165, 43-53, 2020.

DOI Google Scholar

[52]

Wang, L.; Huang, Y. C.; Hou, Y. L.; Zhang, S. M.; Shan, J. Graph attention convolution for point cloud semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10288-10297, 2019.

DOI

[53]

Chen, C.; Fragonara, L. Z.; Tsourdos, A. GAPNet: Graph attention based point neural network for exploiting local feature of point cloud. arXiv preprint arXiv:1905.08705, 2019.

Google Scholar

[54]

Wen, C. C.; Li, X.; Yao, X. J.; Peng, L.; Chi, T. H. Airborne LiDAR point cloud classification with global-local graph attention convolution neural network. ISPRS Journal of Photogrammetry and Remote Sensing Vol. 173, 181-194, 2021.

DOI Google Scholar

[55]

Wen, X.; Li, T. Y.; Han, Z. Z.; Liu, Y. S. Point cloud completion by skip-attention network with hierarchical folding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1936-1945, 2020.

DOI

[56]

Wang, Y.; Solomon, J. Deep closest point: Learning representations for point cloud registration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3522-3531, 2019.

DOI

[57]

Yew, Z. J.; Lee, G. H. 3DFeat-Net: Weakly supervised local 3D features for point cloud registration. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11219. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 607-623, 2018.

[58]

Zhang, W. X.; Xiao, C. X. PCAN: 3D attention map learning using contextual information for point cloud based retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 12428-12437, 2019.

DOI

[59]

Sun, Q.; Liu, H. Y.; He, J.; Fan, Z. X.; Du, X. Y. DAGC: Employing dual attention and graph convolution for point cloud based place recognition. In: Proceedings of the International Conference on Multimedia Retrieval, 224-232, 2020.

DOI

[60]

Guo, M. H.; Cai, J. X.; Liu, Z. N.; Mu, T. J.; Martin, R. R.; Hu, S. M. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187-199, 2021.

DOI Google Scholar

[61]

Zhao, H.; Jiang, L.; Jia, J.; Torr, P.; Koltun, V. Point transformer. arXiv preprint arXiv:2012.09164, 2020.

DOI Google Scholar

[62]

Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In: Proceedingsof the 33rd International Conference on Neural Information Processing Systems, 8026-8037, 2019.

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 26 July 2021

Accepted: 25 August 2021

Published: 08 March 2022

Issue date: September 2022

Copyright

Acknowledgements

We thank Jiazhao Zhang for server management. This paper is supported in part by National Nature Science Foundation of China (62132021, 62102435, 62002375, 62002376), National Key R&D Program of China (2018AAA0102200), and NUDT Research Grants (ZK19-30).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www. editorialmanager.com/cvmj.