KEPT: Knowledge-enhanced prediction of trajectories from consecutive driving frames with vision-language models

Yujin Wang; Tianyi Wang; Quanfeng Liu; Wenxian Fan; Junfeng Jiao; Christian Claudel; Yunbing Yan; Bingzhao Gao; Jianqiang Wang; Hong Chen

doi:10.26599/COMMTR.2026.9640012

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (6.7 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Research Article | Open Access

KEPT: Knowledge-enhanced prediction of trajectories from consecutive driving frames with vision-language models

Yujin Wang^¹, Tianyi Wang^², Quanfeng Liu^¹, Wenxian Fan^¹, Junfeng Jiao^³, Christian Claudel^², Yunbing Yan^⁴, Bingzhao Gao^¹(

), Jianqiang Wang^⁵, Hong Chen^⁶

1College of Automotive and Energy Engineering, Tongji University, Shanghai 201804, China

2Department of Civil, Architectural, and Environmental Engineering, The University of Texas at Austin, Austin Texas 78712, USA

3School of Architecture, The University of Texas at Austin, Austin Texas 78712, USA

4School of Automotive and Traffic Engineering, Wuhan University of Science and Technology, Wuhan 430081, China

5School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China

6College of Electronic and Information Engineering, Tongji University, Shanghai 201804, China

Show Author Information

Abstract

Accurate short-horizon trajectory prediction is crucial for safe and reliable autonomous driving. However, existing vision language models (VLMs) often fail to accurately understand driving scenes and generate trustworthy trajectories. To address this challenge, this study introduces KEPT, a knowledge-enhanced VLM framework that predicts ego trajectories directly from consecutive front-view driving frames. KEPT integrates a temporal frequency–spatial fusion (TFSF) video encoder, which is trained via self-supervised learning with hard-negative mining, with a k-means & HNSW retrieval-augmented generation (RAG) pipeline. Retrieved prior knowledge is added into chain-of-thought (CoT) prompts with explicit planning constraints, while a triple-stage fine-tuning paradigm aligns the VLM backbone to enhance spatial perception and trajectory prediction capabilities. Evaluated on nuScenes dataset, KEPT achieves the best open-loop performance compared with baseline methods. Ablation studies on fine-tuning stages, Top-K value of RAG, different retrieval strategies, vision encoders, and VLM backbones are conducted to demonstrate the effectiveness of KEPT. These results indicate that KEPT offers a promising, data-efficient way toward trustworthy trajectory prediction in autonomous driving.

Graphical Abstract

Keywords

autonomous driving trajectory prediction vision-language model retrieval-augmented generation chain-of-thought prompt

References

【1】

Crossref Google Scholar

Communications in Transportation Research

Volume 6 Issue 1,
March 2026

Article number: 9640012

DOI: 10.26599/COMMTR.2026.9640012

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Wang Y, Wang T, Liu Q, et al. KEPT: Knowledge-enhanced prediction of trajectories from consecutive driving frames with vision-language models. Communications in Transportation Research, 2026, 6(1): 9640012. https://doi.org/10.26599/COMMTR.2026.9640012

1852

Views

215

Downloads

Crossref

Web of Science

Scopus

Google Scholar
Citation

Received: 17 October 2025

Revised: 25 November 2025

Accepted: 12 January 2026

Published: 31 March 2026

This is an open access article under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0 http://creativecommons.org/licenses/by/4.0/).