Recovering dense and uniformly distributed point clouds from sparse or noisy data remains a significant challenge. Recently, great progress has been made on these tasks, but usually at the cost of increasingly intricate modules or complicated network architectures, leading to long inference time and huge resource consumption. Instead, we embrace simplicity and present a simple yet efficient method for jointly upsampling and cleaning point clouds. Our method leverages an off-the-shelf octree-based 3D U-Net (OUNet) with minor modifications, enabling both upsampling and cleaning within a single network. Our network directly processes each input point cloud as a whole instead of processing point cloud patches as in previous works, which significantly eases the implementation and brings at least 47 times faster inferencing. Extensive experiments demonstrate that our method achieves state-of-the-art performance with huge efficiency advantages on a series of benchmarks. We expect our method to serve as a simple baseline and inspire researchers to rethink method designs for point cloud upsampling and cleaning. Our code and trained models are available at https://github.com/octree-nn/upsample-clean.
- Article type
- Year
- Co-author
Open Access
Research Article
Issue
Open Access
Research Article
Issue
The use of pretrained backbones with fine-tuning has shown success for 2D vision and natural language processing tasks, with advantages over task-specific networks. In this paper, we introduce a pretrained 3D backbone, called Swin3D, for 3D indoor scene understanding. We designed a 3D Swin Transformer as our backbone network, which enables efficient self-attention on sparse voxels with linear memory complexity, making the backbone scalable to large models and datasets. We also introduce a generalized contextual relative positional embedding scheme to capture various irregularities of point signals for improved network performance. We pretrained a large Swin3D model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset. Our model pretrained on the synthetic dataset not only generalizes well to downstream segmentation and detection on real 3D point datasets but also outperforms state-of-the-art methods on downstream tasks with +2.3 mIoU and +2.2 mIoU on S3DIS Area5 and 6-fold semantic segmentation, respectively, +1.8 mIoU on ScanNet segmentation (val), +1.9 mAP@0.5 on ScanNet detection, and +8.1 mAP@0.5 on S3DIS detection. A series of extensive ablation studies further validated the scalability, generality, and superior performance enabled by our approach.
京公网安备11010802044758号