Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
The use of pretrained backbones with fine-tuning has shown success for 2D vision and natural language processing tasks, with advantages over task-specific networks. In this paper, we introduce a pretrained 3D backbone, called Swin3D, for 3D indoor scene understanding. We designed a 3D Swin Transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large Swin3D model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, respectively, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validated the scalability, generality, and superior performance enabled by our approach.
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S. W.; Khan, F. S.; Shah, M. Transformers in vision: A survey. ACM Computing Surveys Vol. 54, No. 10, Article No. 200, 2022.
Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 45, No. 1, 87–110, 2023.
Guo, M. H.; Xu, T. X.; Liu, J. J.; Liu, Z. N.; Jiang, P. T.; Mu, T. J.; Zhang, S. H.; Martin, R. R.; Cheng, M. M.; Hu, S. M. Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331–368, 2022.
Wu, S.; Wu, T.; Tan, H.; Guo, G. Pale transformer: A general vision transformer backbone with pale-shaped attention. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 36, No. 3, 2731–2739, 2022.
Guo, M. H.; Cai, J. X.; Liu, Z. N.; Mu, T. J.; Martin, R. R.; Hu, S. M. PCT: Point cloud transformer. Computational Visual Media Vol. 7, No. 2, 187–199, 2021.
Wang, P.-S.; Yang, Y.-Q.; Zou, Q.-F.; Wu, Z.; Liu, Y.; Tong, X. Unsupervised 3D learning for shape analysis via multiresolution instance discrimination. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 4, 2773–2781, 2021.
Zou, C.; Su, J. W.; Peng, C. H.; Colburn, A.; Shan, Q.; Wonka, P.; Chu, H. K.; Hoiem, D. Manhattan room layout reconstruction from a single 360° image: A comparative study of state-of-the-art methods. International Journal of Computer Vision Vol. 129, No. 5, 1410–1431, 2021.
Wang, P. S. OctFormer: Octree-based transformers for 3D point clouds. ACM Transactions on Graphics Vol. 42, No. 4, Article No. 155, 2023.
Wang, P. S.; Liu, Y.; Guo, Y. X.; Sun, C. Y.; Tong, X. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. ACM Transactions on Graphics Vol. 36, No. 4, Article No. 72, 2017.
Huang, S. S.; Ma, Z. Y.; Mu, T. J.; Fu, H.; Hu, S. M. Supervoxel convolution for online 3D semantic segmentation. ACM Transactions on Graphics Vol. 40, No. 3, Article No. 34, 2021.
Cai, J. X.; Mu, T. J.; Lai, Y. K.; Hu, S. M. LinkNet: 2D–3D linked multi-modal network for online semantic segmentation of RGB-D videos. Computers & Graphics Vol. 98, 37–47, 2021.
299
Views
43
Downloads
4
Crossref
1
Web of Science
0
Scopus
0
CSCD
Altmetrics
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
To submit a manuscript, please go to https://jcvm.org.