Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Accurate short-horizon trajectory prediction is crucial for safe and reliable autonomous driving. However, existing vision language models (VLMs) often fail to accurately understand driving scenes and generate trustworthy trajectories. To address this challenge, this study introduces KEPT, a knowledge-enhanced VLM framework that predicts ego trajectories directly from consecutive front-view driving frames. KEPT integrates a temporal frequency–spatial fusion (TFSF) video encoder, which is trained via self-supervised learning with hard-negative mining, with a k-means & HNSW retrieval-augmented generation (RAG) pipeline. Retrieved prior knowledge is added into chain-of-thought (CoT) prompts with explicit planning constraints, while a triple-stage fine-tuning paradigm aligns the VLM backbone to enhance spatial perception and trajectory prediction capabilities. Evaluated on nuScenes dataset, KEPT achieves the best open-loop performance compared with baseline methods. Ablation studies on fine-tuning stages, Top-K value of RAG, different retrieval strategies, vision encoders, and VLM backbones are conducted to demonstrate the effectiveness of KEPT. These results indicate that KEPT offers a promising, data-efficient way toward trustworthy trajectory prediction in autonomous driving.

This is an open access article under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0 http://creativecommons.org/licenses/by/4.0/).
Comments on this article