Journal Home > Volume 9 , Issue 4

Good proposal initials are critical for 3D object detection applications. However, due to the significant geometry variation of indoor scenes, incomplete and noisy proposals are inevitable in most cases. Mining feature information among these "bad" proposals may mislead the detection. Contrastive learning provides a feasible way for representing proposals, which can align complete and incomplete/noisy proposals in feature space. The aligned feature space can help us build robust 3D representation even if bad proposals are given. Therefore, we devise a new contrast learning framework for indoor 3D object detection, called EFECL, that learns robust 3D representations by contrastive learning of proposals on two different levels. Specifically, we optimize both instance-level and category-level contrasts to align features by capturing instance-specific characteristics and semantic-aware common patterns. Furthermore, we propose an enhanced feature aggregation module to extract more general and informative features for contrastive learning. Evaluations on ScanNet V2 and SUN RGB-D benchmarks demonstrate the generalizability and effectiveness of our method, and our method can achieve 12.3% and 7.3% improvements on both datasets over the benchmark alternatives. The code and models are publicly available at https://github.com/YaraDuan/EFECL.


menu
Abstract
Full text
Outline
About this article

EFECL: Feature encoding enhancement with contrastive learning for indoor 3D object detection

Show Author's information Yao Duan1Renjiao Yi1Yuanming Gao1Kai Xu1Chenyang Zhu1( )
School of Computing, National University of Defense Technology, Changsha 410000, China

Abstract

Good proposal initials are critical for 3D object detection applications. However, due to the significant geometry variation of indoor scenes, incomplete and noisy proposals are inevitable in most cases. Mining feature information among these "bad" proposals may mislead the detection. Contrastive learning provides a feasible way for representing proposals, which can align complete and incomplete/noisy proposals in feature space. The aligned feature space can help us build robust 3D representation even if bad proposals are given. Therefore, we devise a new contrast learning framework for indoor 3D object detection, called EFECL, that learns robust 3D representations by contrastive learning of proposals on two different levels. Specifically, we optimize both instance-level and category-level contrasts to align features by capturing instance-specific characteristics and semantic-aware common patterns. Furthermore, we propose an enhanced feature aggregation module to extract more general and informative features for contrastive learning. Evaluations on ScanNet V2 and SUN RGB-D benchmarks demonstrate the generalizability and effectiveness of our method, and our method can achieve 12.3% and 7.3% improvements on both datasets over the benchmark alternatives. The code and models are publicly available at https://github.com/YaraDuan/EFECL.

Keywords: object detection, indoor scene, contrastive learning, feature enhancement

References(53)

[1]
Dai, A.; Chang, A. X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2432–2443, 2017.
DOI
[2]
Song, S. R.; Lichtenberg, S. P.; Xiao, J. X. SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 567–576, 2015.
DOI
[3]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3354–3361, 2012.
DOI
[4]
Qi, C. R.; Litany, O.; He, K. M.; Guibas, L. Deep Hough voting for 3D object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9276–9285, 2019.
DOI
[5]
Zhang, Z. W.; Sun, B.; Yang, H. T.; Huang, Q. X. H3DNet: 3D object detection using hybrid geometric primitives. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12357. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 311–329, 2020.
DOI
[6]
Yan, Y.; Mao, Y. X.; Li, B. SECOND: Sparsely embedded convolutional detection. Sensors Vol. 18, No. 10, 3337, 2018.
[7]
Yang, H.; Shi, C.; Chen, Y. H.; Wang, L. W. Boosting 3D object detection via object-focused image fusion. arXiv preprint arXiv:2207.10589, 2022.
[8]
Liu, Z.; Zhang, Z.; Cao, Y.; Hu, H.; Tong, X. Group-free 3D object detection via transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2929–2938, 2021.
DOI
[9]
Yin, T. W.; Zhou, X. Y.; Krähenbühl, P. Center-based 3D object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11779–11788, 2021.
DOI
[10]
Duan, Y.; Zhu, C. Y.; Lan, Y. Q.; Yi, R. J.; Liu, X. W.; Xu, K. DisARM: Displacement aware relation module for 3D detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16959–16968, 2022.
DOI
[11]
Cheng, B. W.; Sheng, L.; Shi, S. S.; Yang, M.; Xu, D. Back-tracing representative points for voting-based 3D object detection in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8959–8968, 2021.
DOI
[12]
Xie, Q.; Lai, Y. K.; Wu, J.; Wang, Z. T.; Zhang, Y. M.; Xu, K.; Wang, J. MLCVNet: Multi-level context VoteNet for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10444–10453, 2020.
DOI
[13]
Qi, C. R.; Chen, X. L.; Litany, O.; Guibas, L. J. ImVoteNet: Boosting 3D object detection in point clouds with image votes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4403–4412, 2020.
DOI
[14]
Yin, J. B.; Zhou, D. F.; Zhang, L. J.; Fang, J.; Xu, C. Z.; Shen, J. B.; Wang, W. G. ProposalContrast: Unsupervised pre-training for LiDAR-based 3D object detection. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13699. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 17–33, 2022.
DOI
[15]
Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y. N.; Caine, B.; et al. Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2443–2451, 2020.
DOI
[16]
Zhou, Y.; Tuzel, O. VoxelNet: End-to-end learning for point cloud based 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4490–4499, 2018.
DOI
[17]
Yang, B.; Luo, W. J.; Urtasun, R. PIXOR: Real-time 3D object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7652–7660, 2018.
DOI
[18]
Shi, S. S.; Wang, Z.; Shi, J. P.; Wang, X. G.; Li, H. S. From points to parts: 3D object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 43, No. 8, 2647–2664, 2021.
[19]
Shi, S. S.; Guo, C. X.; Jiang, L.; Wang, Z.; Shi, J. P.; Wang, X. G.; Li, H. S. PV-RCNN: Point-voxel feature set abstraction for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10526–10535, 2020.
DOI
[20]
Chen, X. Z.; Ma, H. M.; Wan, J.; Li, B.; Xia, T. Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6526–6534, 2017.
DOI
[21]
Wang, H. Y.; Shi, S. S.; Yang, Z.; Fang, R. Y.; Qian, Q.; Li, H. S.; Schiele, B.; Wang, L. W. RBGNet: Ray-based grouping for 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1100–1109, 2022.
DOI
[22]
Lan, Y. Q.; Duan, Y.; Shi, Y. F.; Huang, H.; Xu, K. 3DRM: Pair-wise relation module for 3D object detection. Computers & Graphics Vol. 98, 58–70, 2021.
[23]
Lan, Y. Q.; Duan, Y.; Liu, C. Y.; Zhu, C. Y.; Xiong, Y. S.; Huang, H.; Xu, K. ARM3D: Attention-based relation module for indoor 3D object detection. Computational Visual Media Vol. 8, No. 3, 395–414, 2022.
[24]
Charles, R. Q.; Hao, S.; Mo, K. C.; Guibas, L. J. PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 77–85, 2017.
DOI
[25]
Qi, C. R.; Yi, L.; Su, H.; Guibas, L. J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 5105–5114, 2017.
[26]
Chen, J. T.; Lei, B. W.; Song, Q. Y.; Ying, H. C.; Chen, D. Z.; Wu, J. A hierarchical graph network for 3D object detection on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 389–398, 2020.
DOI
[27]
Xie, Q.; Lai, Y. K.; Wu, J.; Wang, Z. T.; Lu, D. N.; Wei, M. Q.; Wang, J. VENet: Voting enhancement network for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3692–3701, 2021.
DOI
[28]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010, 2017.
[29]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A simple framework for contrastive learning of visual representations. In: Proceedings of the 37th International Conference on Machine Learning, 1597–1607, 2020.
[30]
He, K. M.; Fan, H. Q.; Wu, Y. X.; Xie, S. N.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9726–9735, 2020.
DOI
[31]
Hjelm, R. D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018.
[32]
Wang, W. G.; Zhou, T. F.; Yu, F.; Dai, J. F.; Konukoglu, E.; Van Gool, L. Exploring cross-image pixel contrast for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 7283–7293, 2021.
DOI
[33]
Yin, J. B.; Fang, J.; Zhou, D. F.; Zhang, L. J.; Xu, C. Z.; Shen, J. B.; Wang, W. G. Semi-supervised 3D object detection with proficient teachers. In: Computer Vision – ECCV 2022. Lecture Notes in Computer Science, Vol. 13698. Avidan, S.; Brostow, G.; Cissé, M.; Farinella, G. M.; Hassner, T. Eds. Springer Cham, 727–743, 2022.
DOI
[34]
Purushwalkam, S.; Gupta, A. Demystifying contrastive self-supervised learning: Invariances, augmentations and dataset biases. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, Article No. 287, 3407–3418, 2020.
[35]
Hénaff, O. J.; Koppula, S.; Alayrac, J. B.; van den Oord, A.; Vinyals, O.; Carreira, J. Efficient visual pretraining with contrastive detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10066–10076, 2021.
DOI
[36]
Yang, C. Y.; Wu, Z. R.; Zhou, B. L.; Lin, S. Instance localization for self-supervised detection pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3986–3995, 2021.
DOI
[37]
Xiao, T. T.; Reed, C. J.; Wang, X. L.; Keutzer, K.; Darrell, T. Region similarity representation learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10519–10528, 2021.
DOI
[38]
Wei, F. Y.; Gao, Y.; Wu, Z. R.; Hu, H.; Lin, S. Aligning pretraining for detection via object-level contrastive learning. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 22682–22694, 2021.
[39]
Bai, Y. T.; Chen, X. L.; Kirillov, A.; Yuille, A.; Berg, A. C. Point-level region contrast for object detection pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16040–16049, 2022.
DOI
[40]
Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Van Gool, L. Revisiting contrastive methods for unsupervised learning of visual representations. In: Proceedings of the 35th Conference on Neural Information Processing Systems, 16238–16250, 2021.
[41]
Xie, E. Z.; Ding, J.; Wang, W. H.; Zhan, X. H.; Xu, H.; Sun, P. Z.; Li, Z. G.; Luo, P. DetCo: Unsupervised contrastive learning for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 8372–8381, 2021.
DOI
[42]
Sun, B.; Li, B. H.; Cai, S. C.; Yuan, Y.; Zhang, C. FSCE: Few-shot object detection via contrastive proposal encoding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7348–7358, 2021.
DOI
[43]
Zhan, F. N.; Yu, Y. C.; Wu, R. L.; Zhang, J. H.; Lu, S. J.; Zhang, C. G. Marginal contrastive correspondence for guided image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10653–10662, 2022.
DOI
[44]
Chen, S. X.; Nie, X. H.; Fan, D.; Zhang, D. Q.; Bhat, V.; Hamid, R. Shot contrastive self-supervised learning for scene boundary detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9791–9800, 2021.
DOI
[45]
Afham, M.; Dissanayake, I.; Dissanayake, D.; Dharmasiri, A.; Thilakarathna, K.; Rodrigo, R. CrossPoint: Self-supervised cross-modal contrastive learning for 3D point cloud understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9892–9902, 2022.
DOI
[46]
Xie, S. N.; Gu, J. T.; Guo, D. M.; Qi, C. R.; Guibas, L.; Litany, O. PointContrast: Unsupervised pre-training for 3D point cloud understanding. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12348. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 574–591, 2020.
DOI
[47]
Hou, J.; Graham, B.; Nießner, M.; Xie, S. N. Exploring data-efficient 3D scene understanding with contrastive scene contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15582–15592, 2021.
DOI
[48]
Liu, K. C.; Xiao, A. R.; Zhang, X. Q.; Lu, S. J.; Shao, L. FAC: 3D representation learning via foreground aware feature contrast. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9476–9485, 2023.
DOI
[49]
Rao, Y. M.; Liu, B. L.; Wei, Y.; Lu, J. W.; Hsieh, C. J.; Zhou, J. RandomRooms: Unsupervised pre-training from synthetic shapes and randomized layouts for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 3263–3272, 2021.
DOI
[50]
Van den Oord, A.; Li, Y. Z.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[51]
Kingma, D. P.; Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[52]
MMDetection3D Contributors. OpenMMLab’s next-generation platform for general 3D object detection. 2020. Available at https://github.com/open-mmlab/mmdetection3d.
[53]
Zhang, Z. W.; Girdhar, R.; Joulin, A.; Misra, I. Self-supervised pretraining of 3D features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10232–10243, 2021.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 07 April 2023
Accepted: 02 July 2023
Published: 03 August 2023
Issue date: December 2023

Copyright

© The Author(s) 2023.

Acknowledgements

We thank Yuqing Lan for visualizing the results. This work is supported in part by the National Key R&D Program of China (2018AAA0102200), National Natural Science Foundation of China (62002375, 62002376, 62132021), Natural Science Foundation of Hunan Province of China (2021RC3071, 2022RC1104, 2021JJ40696), and NUDT Research Grants(ZK22-52).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduc-tion in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Return