Autonomous and high-precision localization is critical for UAVs (unmanned aerial vehicles) to complete tasks safely. GNSS (global navigation satellite system) is the mainstream localization technology, but its signals are blocked in urban canyons or mountainous areas and vulnerable to jamming and spoofing, severely reducing localization accuracy. INS (inertial navigation system) works independently but has drawbacks: high-precision INS is overly costly, while low-cost MEMS (micro-electro-mechanical system) INS suffers from time-dependent drift. By contrast, absolute visual localization, which involves matching real-time UAV images with geo-tagged reference images, offers unique advantages of no drift and anti-electromagnetic interference. Traditionally, visual localization methods were designed for high-altitude (above 500 m) nadir observation. Under these conditions, ground scenes can be simplified to 2D planes. The geometric relationship between UAV images and reference images is mainly reflected in scaling, rotation, and translation, which can be described by similarity transformation or homography models. Consequently, image-matching-based UAV localization algorithms have achieved high accuracy and strong robustness for high-altitude nadir imaging. Currently, UAVs are shifting to low-altitude, refined operations. Small commercial UAVs are usually restricted to altitudes below 500 m and often require oblique observation for side-view information. This leads to significant 3D stereo effects and perspective distortion, invalidating the 2D assumption and causing large view differences. Low-altitude imaging also features a small field of view and rapid scale changes, further complicating localization. Traditional visual localization methods therefore struggle to meet high-precision requirements under low-altitude oblique conditions.
This survey focuses on the visual localization problem of low-altitude UAVs and centers on the "retrieval-matching-pose estimation" hierarchical framework, which effectively addresses the challenges of significant view differences, rapid scale variations, and object occlusions through a coarse-to-fine strategy. Compared with other frameworks such as relative visual localization (with cumulative errors), end-to-end direct localization (with poor generalization), and map-free localization (scene-dependent), this hierarchical framework balances search efficiency, positioning accuracy, and scene generalization, becoming a robust technical path for low-altitude long-endurance absolute localization. This survey systematically reviews the technical development and research status of the three core modules of the framework. For cross-view image retrieval, methods have evolved from traditional handcrafted feature-based approaches to deep learning-based methods. Early methods include template matching using similarity metrics (e.g., SAD, SSD, NCC) as well as local feature aggregation methods (e.g., BoW and VLAD). However, these traditional methods struggle with significant view differences. Recent deep learning methods have improved cross-domain generalization through feature map reorganization (e.g., annular or dense segmentation) and optimized training strategies (e.g., contrastive learning with InfoNCE loss, self-supervised adaptation). For cross-view image matching, deep learning models have gradually replaced traditional handcrafted feature methods (e.g., SIFT, SURF, ORB). Existing matching networks are divided into sparse, semi-dense, and dense types: sparse matching methods (e.g., SuperPoint+LightGlue) prioritize computational efficiency, while dense matching methods (e.g., RoMa) achieve higher matching accuracy. For UAV pose estimation, classic PnP (perspective-n-point) algorithms and their variants are widely used. Improved methods adapted to UAV scenarios integrate IMU prior information to reduce the degrees of freedom of the problem, or use the RANSAC algorithm to filter mismatched points, enhancing stability under low-altitude observation conditions. Additionally, this survey summarizes representative datasets for cross-view image retrieval and matching in low-altitude visual localization, such as University-1652 (simulated data) and AnyVisLoc (real-scene multi-view data), and analyzes the performance of existing methods on edge computing platforms (e.g., NVIDIA Jetson series). The results show that most methods achieve meter-level accuracy but face challenges in real-time inference and hardware cost control.
The "retrieval-matching-pose estimation" framework is a reliable technical path for low-altitude UAV absolute visual localization, balancing search efficiency, positioning accuracy, and generalization. Current technologies still face limitations in cross-domain generalization, real-time inference on edge platforms, and robustness to complex environments. Future research should focus on lightweight model design for edge deployment, self-supervised learning to reduce data dependency, construction of high-quality datasets, and multi-source information fusion to enhance system reliability. This survey provides a valuable reference for academic research and engineering applications of low-altitude UAV absolute visual localization.
京公网安备11010802044758号