Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
Pedestrian crossing intention prediction is crucial for autonomous driving. While existing models have achieved high accuracy, their generalization and robustness remain limited, hindering their performance in real-world scenarios. To overcome these limitations, we introduce the LVLMPed-CoT, a large vision language model (LVLM) that incorporates a chain-of-thought (CoT) mechanism to enhance pedestrian crossing intention prediction. It takes multimodal data as input and employs data distillation along with a two stage fine-tuning strategy to elicit the implicit CoT capability of a lightweight vision-language model for enhanced perception, reasoning, and prediction. The unified LVLMPed-CoT is trained on a joint open-source dataset (JAAD and PIE) and achieves superior or comparable performance to state-of-the-art models on both large-scale public datasets. The ablation study validates the contribution of the CoT prompt design and the two-stage fine-tuning strategy to the model's performance. Further analysis investigates the impact of input data sequence length and image quality on both accuracy and inference time, as well as the interpretability of the enhanced CoT reasoning ability achieved through fine-tuning.
This is an open access article under the terms of the Creative Commons Attribution 4.0 International License (CC BY 4.0 http://creativecommons.org/licenses/by/4.0/).
Comments on this article