AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Research Article | Open Access

GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models

Haicheng Liaoa,1Huanming Shenb,1Zhenning Lic( )Chengyue WangdGuofa LieYiming BiefChengzhong Xua( )
State Key Laboratory of Internet of Things for Smart City and Department of Computer and Information Science, University of Macau, Macau SAR, 999078, China
Department of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, 610000, China
State Key Laboratory of Internet of Things for Smart City and Departments of Civil and Environmental Engineering and Computer and Information Science, University of Macau, Macau SAR, 999078, China
State Key Laboratory of Internet of Things for Smart City and Departments of Civil and Environmental Engineering, University of Macau, Macau SAR, 999078, China
College of Mechanical and Vehicle Engineering, Chongqing University, Chongqing, 400030, China
School of Transportation, Jilin University, Changchun, 130000, China

1 These authors contributed equally to this work.

Show Author Information

Abstract

In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs. Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders—Text, Emotion, Image, Context, and Cross-Modal—with a multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments.

Electronic Supplementary Material

Download File(s)
ctr-4-2-100116_ESM.pdf (384.1 KB)

References

 
Bhattacharyya, A., Mauceri, C., Palmer, M., Heckman, C., 2022. Aligning images and text with semantic role labels for fine-grained cross-modal understanding. In: Proceedings of the Thirteenth Language Resources and Evaluation Conference, pp. 4944–4954.
 

Bonnefon, J.F., Shariff, A., Rahwan, I., 2016. The social dilemma of autonomous vehicles. Science 352, 1573–1576.

 

Bugliarello, E., Cotterell, R., Okazaki, N., Elliott, D., 2021. Multimodal pretraining unmasked: a meta-analysis and a unified framework of vision-and-language berts. Trans. Assoc. Comput. Linguist 9, 978–994.

 
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., et al., 2020. nuScenes: a multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. CVPR), pp. 11621–11631.
 
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S., 2020. End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229.
 
Chan, H.P., Guo, M., Xu, C.Z., 2022. Grounding commands for autonomous vehicles via layer fusion with region-specific dynamic layer attention. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems. IROS), 12464–12470.
 
Chen, X., Ma, L., Chen, J., Jie, Z., Liu, W., Luo, J., 2018. Real-time referring expression comprehension by single-stage grounding network. https://doi.org/10.48550/arXiv.1812.03426.
 
Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., et al., 2020. Uniter: universal image-text representation learning. In: European Conference on Computer Vision, pp. 104–120.
 
Cheng, B., Schwing, A., Kirillov, A., 2021. Per-pixel classification is not all you need for semantic segmentation. https://doi.org/10.48550/arXiv.2107.06278.
 
Cheng, W., Yin, J., Li, W., Yang, R., Shen, J., 2023. Language-guided 3D object detection in point cloud for autonomous driving. https://doi.org/10.48550/arXiv.2305.15765.
 
Dai, H., Luo, S., Ding, Y., Shao, L., 2020. Commands for autonomous vehicles by progressively stacking visual-linguistic representations. In: European Conference on Computer Vision, pp. 27–32.
 
Deng, C., Wu, Q., Wu, Q., Hu, F., Lyu, F., Tan, M., 2018. Visual grounding via accumulated attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7746–7755.
 
Deruyttere, T., Collell, G., Moens, M.F., 2020a. Giving commands to a self-driving car: A multimodal reasoner for visual grounding. https://doi.org/10.48550/arXiv.2003.08717.
 

Deruyttere, T., Milewski, V., Moens, M.F., 2021. Giving commands to a self-driving car: how to deal with uncertain situations? Eng. Appl. Artif. Intell. 103, 104257.

 
Deruyttere, T., Vandenhende, S., Grujicic, D., Liu, Y., Van Gool, L., Blaschko, M., et al., 2020b. Commands 4 autonomous vehicles (C4AV) workshop summary. In: European Conference on Computer Vision, pp. 3–26.
 
Deruyttere, T., Vandenhende, S., Grujicic, D., Van Gool, L., Moens, M.F., 2019. Talk2Car: Taking control of your self-driving car. https://doi.org/10.48550/arXiv.1909.10838.
 
Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. https://doi.org/10.48550/arXiv.1810.04805.
 
Ding, H., Liu, C., Wang, S., Jiang, X., 2021. Vision-language transformer and query generation for referring segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. (ICCV), pp. 16321–16330.
 

Ding, H., Liu, C., Wang, S., Jiang, X., 2023. VLT: vision-language transformer and query generation for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45, 7900–7916.

 

Dong, J., Chen, S., Miralinaghi, M., Chen, T., Labi, S., 2022. Development and testing of an image transformer for explainable autonomous driving systems. J. Intell. Connect. Veh. 5, 235–249.

 

Dong, J., Chen, S., Miralinaghi, M., Chen, T., Li, P., Labi, S., 2023a. Why did the AI make that decision? Towards an explainable artificial intelligence (XAI) for autonomous driving systems. Transport. Res. C Emerg. Technol. 156, 104358.

 

Dong, J., Long, Z., Mao, X., Lin, C., He, Y., Ji, S., 2021. Multi-level alignment network for domain adaptive cross-modal retrieval. Neurocomputing 440, 207–219.

 
Dong, Z., Zhang, W., Huang, X., Ji, H., Zhan, X., Chen, J., 2023b. HuBo-VLM: Unified vision-language model designed for human robot interaction tasks. https://doi.org/10.48550/arXiv.2308.12537.
 
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., et al., 2020. An image is worth 16x16 words: Transformers for image recognition at scale. https://doi.org/10.48550/arXiv.2010.11929.
 

Everingham, M., Gool, L., Williams, C.K.I., Winn, J., Zisserman, A., 2010. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 303–338.

 

Everingham, M., Winn, J., 2012. The pascal visual object classes challenge 2012 (voc2012) development kit. Pattern Anal. Stat. Model Comput. Learn, Tech. Rep. 2007, 5.

 
Girshick, R., Donahue, J., Darrell, T., Malik, J., 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587.
 

Grujicic, D., Deruyttere, T., Moens, M.F., Blaschko, M.B., 2022. Predicting physical world destinations for commands given to self-driving cars. Proc. AAAI Conf. Artif. Intell. 36, 715–725.

 

Hao, S., Lee, D.H., Zhao, D., 2019. Sequence to sequence learning with attention mechanism for short-term passenger flow prediction in large-scale metro system. Transport. Res. C Emerg. Technol. 107, 287–300.

 
He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learning for image recognition. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778.
 
Hu, R., Xu, H., Rohrbach, M., Feng, J., Saenko, K., Darrell, T., 2016. Natural language object retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (CVPR), pp. 4555–4564.
 
Hudson, D.A., Manning, C.D., 2018. Compositional attention networks for machine reasoning. https://doi.org/10.48550/arXiv.1803.03067.
 
Jain, K., Chhangani, V., Tiwari, A., Krishna, K.M., Gandhi, V., 2023. Ground then navigate: language-guided navigation in dynamic scenes. In: 2023 IEEE International Conference on Robotics and Automation. (ICRA), pp. 4113–4120.
 
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N., 2021. MDETR-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790.
 
Kheiri, K., Karimi, H., 2023. SentimentGPT: Exploiting GPT for advanced sentiment analysis and its departure from current machine learning. https://doi.org/10.48550/arXiv.2307.10234.
 

Li, G., Qiu, Y., Yang, Y., Li, Z., Li, S., Chu, W., et al., 2022a. Lane change strategies for autonomous vehicles: a deep reinforcement learning approach based on transformer. IEEE Trans Intell Veh 8, 2197–2211.

 
Li, J., Li, D., Xiong, C., Hoi, S., 2022b. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. https://doi.org/10.48550/arXiv.2201.12086.
 
Li, Z., Chen, Z., Li, Y., Xu, C., 2023a. Context-aware trajectory prediction for autonomous driving in heterogeneous environments. Computer-Aided Civil and Infrastructure Engineering. https://doi.org/10.1111/mice.12989.
 

Li, Z., Liao, H., Tang, R., Li, G., Li, Y., Xu, C., 2023b. Mitigating the impact of outliers in traffic crash analysis: a robust bayesian regression approach with application to tunnel crash data. Accid. Anal. Prev. 185, 107019.

 

Li, Z., Xu, C., Bian, Z., 2022c. A force-driven model for passenger evacuation in bus fires. Phys. Stat. Mech. Appl. 589, 126591.

 
Liao, Y., Liu, S., Li, G., Wang, F., Chen, Y., Qian, C., et al., 2020. A real-time cross-modality correlation filtering method for referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10880–10889.
 
Liao, H., Li, Z., Shen, H., Zeng, W., Li, G., Li, S.E., Xu, C., 2023. Bat: Behavior-aware human-like trajectory prediction for autonomous driving. https://doi.org/10.48550/arXiv.2312.06371.
 
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al., 2014. Microsoft COCO: common objects in context. In: European Conference on Computer Vision, pp. 740–755.
 
Loshchilov, I., Hutter, F., 2016. SGDR: Stochastic gradient descent with warm restarts. https://doi.org/10.48550/arXiv.1608.03983.
 
Luo, S., Dai, H., Shao, L., Ding, Y., 2020. C4AV: learning cross-modal representations from transformers. In: European Conference on Computer Vision, pp. 33–38.
 
Mittal, V., 2020. Attngrounder: talking to cars with attention. In: European Conference on Computer Vision, pp. 62–73.
 
OpenAI, 2023. Gpt-4 technical report. https://doi.org/10.48550/arXiv.2303.08774.
 

Othman, K., 2021. Public acceptance and perception of autonomous vehicles: a comprehensive review. AI Ethics 1, 355–387.

 
Ou, J., Zhang, X., 2020. Attention enhanced single stage multimodal reasoner. In: European Conference on Computer Vision, pp. 51–61.
 
Qi, D., Su, L., Song, J., Cui, E., Bharti, T., Sacheti, A., 2020. ImageBERT: Cross-modal pre-training with large-scale weak-supervised image-text data. https://doi.org/10.48550/arXiv.2001.07966.
 
Ren, S., He, K., Girshick, R., Sun, J., 2015. Faster R-CNN: towards real-time object detection with region proposal networks. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1137–1149.
 
Rufus, N., Jain, K., Nair, U.K.R., Gandhi, V., Krishna, K.M., 2021. Grounding linguistic commands to navigable regions. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems. (IROS), 8593–8600.
 
Rufus, N., Nair, U.K.R., Krishna, K.M., Gandhi, V., 2020. Cosine meets softmax: a tough-to-beat baseline for visual grounding. In: European Conference on Computer Vision, pp. 39–50.
 
Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., et al., 2019. VL-BERT: pre-training of generic visual-linguistic representations. https://doi.org/10.48550/arXiv.1908.08530.
 
Tan, H., Bansal, M., 2019. LXMERT: Learning cross-modality encoder representations from transformers. https://doi.org/10.48550/arXiv.1908.07490.
 
Tang, D., Qin, B., Liu, T., 2015. Learning semantic representations of users and products for document level sentiment classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1014–1023.
 
Vandenhende, S., Deruyttere, T., Grujicic, D., 2020. A baseline for the commands for autonomous vehicles challenge. https://doi.org/10.48550/arXiv.2004.13822.
 
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al., 2017. Attention is all You need. In: NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010.
 
Wang, P., Wu, Q., Cao, J., Shen, C., Gao, L., van den Hengel, A., 2019. Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1960–1968.
 
Wen, K., Xia, J., Huang, Y., Li, L., Xu, J., Shao, J., 2021. COOKIE: contrastive cross-modal knowledge sharing pre-training for vision-language representation. In: 2021 IEEE/CVF International Conference on Computer Vision. ICCV), pp. 2208–2217.
 
Yang, K., Lee, D., Whang, T., Lee, S., Lim, H., 2019a. EmotionX-KU: BERT-max basedcontextual emotion classifier. https://doi.org/10.48550/arXiv.1906.11565.
 
Yang, L., Xu, Y., Yuan, C., Liu, W., Li, B., Hu, W., 2022. Improving visual grounding with visual-linguistic verification and iterative reasoning. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9499–9508.
 
Yang, Z., Chen, T., Wang, L., Luo, J., 2020. Improving one-stage visual grounding by recursive sub-query construction. In: European Conference on Computer Vision, pp. 387–404.
 
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J., 2019b. A fast and accurate one-stage approach to visual grounding. In: 2019 IEEE/CVF International Conference on Computer Vision. ICCV), pp. 4683–4693.
 
Yu, L., Lin, Z., Shen, X., Yang, J., Lu, X., Bansal, M., et al., 2018. MAttNet: modular attention network for referring expression comprehension. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1307–1315.
 
Zhuang, B., Wu, Q., Shen, C., Reid, I., van den Hengel, A., 2018. Parallel attention: a unified framework for visual object discovery through dialogs and queries. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4252–4261.
Communications in Transportation Research
Article number: 100116
Cite this article:
Liao H, Shen H, Li Z, et al. GPT-4 enhanced multimodal grounding for autonomous driving: Leveraging cross-modal attention with large language models. Communications in Transportation Research, 2024, 4(2): 100116. https://doi.org/10.1016/j.commtr.2023.100116

125

Views

23

Crossref

16

Web of Science

24

Scopus

Altmetrics

Received: 15 October 2023
Revised: 21 November 2023
Accepted: 25 November 2023
Published: 21 February 2024
© 2023 The Author(s).

This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Return