AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (27.8 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

VLMPed-CoT: A large vision-language model with a chain-of-thought mechanism for pedestrian crossing intention prediction

Yancheng Ling1Zhenlin Qin1Leizhen Wang2Zhendong Liu3Yang Liu4Zhenliang Ma1,5( )
Department of Civil and Architectural Engineering, KTH Royal Institute of Technology, Stockholm 11428, Sweden
Department of Data Science and Artificial Intelligence, Monash University, Clayton 3800, Australia
Department of Engineering Mechanics, KTH Royal Institute of Technology, Stockholm 11428, Sweden
State key Laboratory of Intelligent Green Vehicle and Mobility, Tsinghua University, Beijing 100084, China
Digital Futures, KTH Royal Institute of Technology, Stockholm 10044, Sweden
Show Author Information

Abstract

Pedestrian crossing intention prediction is crucial for autonomous driving. While existing models have achieved high accuracy, their generalization and robustness remain limited, hindering their performance in real-world scenarios. To overcome these limitations, we introduce the LVLMPed-CoT, a large vision language model (LVLM) that incorporates a chain-of-thought (CoT) mechanism to enhance pedestrian crossing intention prediction. It takes multimodal data as input and employs data distillation along with a two stage fine-tuning strategy to elicit the implicit CoT capability of a lightweight vision-language model for enhanced perception, reasoning, and prediction. The unified LVLMPed-CoT is trained on a joint open-source dataset (JAAD and PIE) and achieves superior or comparable performance to state-of-the-art models on both large-scale public datasets. The ablation study validates the contribution of the CoT prompt design and the two-stage fine-tuning strategy to the model's performance. Further analysis investigates the impact of input data sequence length and image quality on both accuracy and inference time, as well as the interpretability of the enhanced CoT reasoning ability achieved through fine-tuning.

References

【1】
【1】
 
 
Communications in Transportation Research
Article number: 9640009

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Ling Y, Qin Z, Wang L, et al. VLMPed-CoT: A large vision-language model with a chain-of-thought mechanism for pedestrian crossing intention prediction. Communications in Transportation Research, 2026, 6(1): 9640009. https://doi.org/10.26599/COMMTR.2026.9640009

1317

Views

199

Downloads

0

Crossref

0

Web of Science

0

Scopus

Received: 15 October 2025
Revised: 13 December 2025
Accepted: 04 January 2026
Published: 31 March 2026
© The Author(s) 2026.

This is an open access article under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0 http://creativecommons.org/licenses/by/4.0/).