AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (6.7 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

KEPT: Knowledge-enhanced prediction of trajectories from consecutive driving frames with vision-language models

Yujin Wang1Tianyi Wang2Quanfeng Liu1Wenxian Fan1Junfeng Jiao3Christian Claudel2Yunbing Yan4Bingzhao Gao1( )Jianqiang Wang5Hong Chen6
College of Automotive and Energy Engineering, Tongji University, Shanghai 201804, China
Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin, Austin Texas 78712, USA
School of Architecture, The University of Texas at Austin, Austin Texas 78712, USA
School of Automotive and Traffic Engineering, Wuhan University of Science and Technology, Wuhan 430081, China
School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China
College of Electronic and Information Engineering, Tongji University, Shanghai 201804, China
Show Author Information

Abstract

Accurate short-horizon trajectory prediction is crucial for safe and reliable autonomous driving. However, existing vision language models (VLMs) often fail to accurately understand driving scenes and generate trustworthy trajectories. To address this challenge, this study introduces KEPT, a knowledge-enhanced VLM framework that predicts ego trajectories directly from consecutive front-view driving frames. KEPT integrates a temporal frequency–spatial fusion (TFSF) video encoder, which is trained via self-supervised learning with hard-negative mining, with a k-means & HNSW retrieval-augmented generation (RAG) pipeline. Retrieved prior knowledge is added into chain-of-thought (CoT) prompts with explicit planning constraints, while a triple-stage fine-tuning paradigm aligns the VLM backbone to enhance spatial perception and trajectory prediction capabilities. Evaluated on nuScenes dataset, KEPT achieves the best open-loop performance compared with baseline methods. Ablation studies on fine-tuning stages, Top-K value of RAG, different retrieval strategies, vision encoders, and VLM backbones are conducted to demonstrate the effectiveness of KEPT. These results indicate that KEPT offers a promising, data-efficient way toward trustworthy trajectory prediction in autonomous driving.

Graphical Abstract

References

【1】
【1】
 
 
Communications in Transportation Research
Article number: 9640012

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Wang Y, Wang T, Liu Q, et al. KEPT: Knowledge-enhanced prediction of trajectories from consecutive driving frames with vision-language models. Communications in Transportation Research, 2026, 6(1): 9640012. https://doi.org/10.26599/COMMTR.2026.9640012

594

Views

53

Downloads

0

Crossref

0

Web of Science

0

Scopus

Received: 17 October 2025
Revised: 25 November 2025
Accepted: 12 January 2026
Published: 31 March 2026
© The Author(s) 2026.

This is an open access article under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0 http://creativecommons.org/licenses/by/4.0/).