This paper presents a novel approach for camera pose refinement based on neural radiance fields (NeRF) by introducing semantic feature consistency to enhance robustness. NeRF has been successfully applied to camera pose estimation by inverting the rendering process given an observed RGB image and an initial pose estimate. However, previous methods only adopted photometric consistency for pose optimization, which is prone to be trapped in local minima. To address this problem, we introduce semantic feature consistency into the existing framework. Specifically, we utilize highlevel features extracted from a convolutional neural network (CNN) pre-trained for image recognition, and maintain consistency of such features between observed and rendered images during the optimization procedure. Unlike the color values at each pixel, these features contain rich semantic information shared within local regions and can be more robust to appearance changes from different viewpoints. Since it is computationally expensive to render a full image with NeRF for feature extraction from CNN, we propose an efficient way to estimate the features of individually rendered pixels by projecting them to a nearby reference image and interpolating its feature maps. Extensive experiments show that our method greatly outperforms the baseline method on both synthetic objects and real-world large indoor scenes, increasing the accuracy of pose estimation by over 6.4%.
- Article type
- Year
- Co-author

In the domain of point cloud registration, the coarse-to-fine feature matching paradigm has received significant attention due to its impressive performance. This paradigm involves a two-step process: first, the extraction of multi-level features, and subsequently, the propagation of correspondences from coarse to fine levels. However, this approach faces two notable limitations. Firstly, the use of the Dual Softmax operation may promote one-to-one correspondences between superpoints, inadvertently excluding valuable correspondences. Secondly, it is crucial to closely examine the overlapping areas between point clouds, as only correspondences within these regions decisively determine the actual transformation. Considering these issues, we propose OAAFormer to enhance correspondence quality. On the one hand, we introduce a soft matching mechanism to facilitate the propagation of potentially valuable correspondences from coarse to fine levels. On the other hand, we integrate an overlapping region detection module to minimize mismatches to the greatest extent possible. Furthermore, we introduce a region-wise attention module with linear complexity during the fine-level matching phase, designed to enhance the discriminative capabilities of the extracted features. Tests on the challenging 3DLoMatch benchmark demonstrate that our approach leads to a substantial increase of about 7% in the inlier ratio, as well as an enhancement of 2%–4% in registration recall. Finally, to accelerate the prediction process, we replace the Conventional Random Sample Consensus (RANSAC) algorithm with the selection of a limited yet representative set of high-confidence correspondences, resulting in a 100 times speedup while still maintaining comparable registration performance.