VLMPed-CoT: A large vision-language model with a chain-of-thought mechanism for pedestrian crossing intention prediction

Yancheng Ling; Zhenlin Qin; Leizhen Wang; Zhendong Liu; Yang Liu; Zhenliang Ma

doi:10.26599/COMMTR.2026.9640009

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (27.8 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Research Article | Open Access

VLMPed-CoT: A large vision-language model with a chain-of-thought mechanism for pedestrian crossing intention prediction

Yancheng Ling^¹, Zhenlin Qin^¹, Leizhen Wang^², Zhendong Liu^³, Yang Liu^⁴, Zhenliang Ma^{¹^,⁵}(

)

1Department of Civil and Architectural Engineering, KTH Royal Institute of Technology, Stockholm 11428, Sweden

2Department of Data Science and Artificial Intelligence, Monash University, Clayton 3800, Australia

3Department of Engineering Mechanics, KTH Royal Institute of Technology, Stockholm 11428, Sweden

4State key Laboratory of Intelligent Green Vehicle and Mobility, Tsinghua University, Beijing 100084, China

5Digital Futures, KTH Royal Institute of Technology, Stockholm 10044, Sweden

Show Author Information

Abstract

Pedestrian crossing intention prediction is crucial for autonomous driving. While existing models have achieved high accuracy, their generalization and robustness remain limited, hindering their performance in real-world scenarios. To overcome these limitations, we introduce the LVLMPed-CoT, a large vision language model (LVLM) that incorporates a chain-of-thought (CoT) mechanism to enhance pedestrian crossing intention prediction. It takes multimodal data as input and employs data distillation along with a two stage fine-tuning strategy to elicit the implicit CoT capability of a lightweight vision-language model for enhanced perception, reasoning, and prediction. The unified LVLMPed-CoT is trained on a joint open-source dataset (JAAD and PIE) and achieves superior or comparable performance to state-of-the-art models on both large-scale public datasets. The ablation study validates the contribution of the CoT prompt design and the two-stage fine-tuning strategy to the model's performance. Further analysis investigates the impact of input data sequence length and image quality on both accuracy and inference time, as well as the interpretability of the enhanced CoT reasoning ability achieved through fine-tuning.

Keywords

pedestrian crossing intention prediction large vision language model (LVLM)chain of thought (CoT)data distillation two-stage fine-tuning strategy

References

【1】

Crossref Google Scholar

Communications in Transportation Research

Volume 6 Issue 1,
March 2026

Article number: 9640009

DOI: 10.26599/COMMTR.2026.9640009

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Ling Y, Qin Z, Wang L, et al. VLMPed-CoT: A large vision-language model with a chain-of-thought mechanism for pedestrian crossing intention prediction. Communications in Transportation Research, 2026, 6(1): 9640009. https://doi.org/10.26599/COMMTR.2026.9640009

1630

Views

237

Downloads

Crossref

Web of Science

Scopus

Google Scholar
Citation

Received: 15 October 2025

Revised: 13 December 2025

Accepted: 04 January 2026

Published: 31 March 2026

This is an open access article under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0 http://creativecommons.org/licenses/by/4.0/).