Discover the SciOpen Platform and Achieve Your Research Goals with Ease.
Search articles, authors, keywords, DOl and etc.
In the field of autonomous vehicles (AVs), accurately discerning commander intent and executing linguistic commands within a visual context presents a significant challenge. This paper introduces a sophisticated encoder-decoder framework, developed to address visual grounding in AVs. Our Context-Aware Visual Grounding (CAVG) model is an advanced system that integrates five core encoders—Text, Emotion, Image, Context, and Cross-Modal—with a multimodal decoder. This integration enables the CAVG model to adeptly capture contextual semantics and to learn human emotional features, augmented by state-of-the-art Large Language Models (LLMs) including GPT-4. The architecture of CAVG is reinforced by the implementation of multi-head cross-modal attention mechanisms and a Region-Specific Dynamic (RSD) layer for attention modulation. This architectural design enables the model to efficiently process and interpret a range of cross-modal inputs, yielding a comprehensive understanding of the correlation between verbal commands and corresponding visual scenes. Empirical evaluations on the Talk2Car dataset, a real-world benchmark, demonstrate that CAVG establishes new standards in prediction accuracy and operational efficiency. Notably, the model exhibits exceptional performance even with limited training data, ranging from 50% to 75% of the full dataset. This feature highlights its effectiveness and potential for deployment in practical AV applications. Moreover, CAVG has shown remarkable robustness and adaptability in challenging scenarios, including long-text command interpretation, low-light conditions, ambiguous command contexts, inclement weather conditions, and densely populated urban environments.
Bonnefon, J.F., Shariff, A., Rahwan, I., 2016. The social dilemma of autonomous vehicles. Science 352, 1573–1576.
Bugliarello, E., Cotterell, R., Okazaki, N., Elliott, D., 2021. Multimodal pretraining unmasked: a meta-analysis and a unified framework of vision-and-language berts. Trans. Assoc. Comput. Linguist 9, 978–994.
Deruyttere, T., Milewski, V., Moens, M.F., 2021. Giving commands to a self-driving car: how to deal with uncertain situations? Eng. Appl. Artif. Intell. 103, 104257.
Ding, H., Liu, C., Wang, S., Jiang, X., 2023. VLT: vision-language transformer and query generation for referring segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 45, 7900–7916.
Dong, J., Chen, S., Miralinaghi, M., Chen, T., Labi, S., 2022. Development and testing of an image transformer for explainable autonomous driving systems. J. Intell. Connect. Veh. 5, 235–249.
Dong, J., Chen, S., Miralinaghi, M., Chen, T., Li, P., Labi, S., 2023a. Why did the AI make that decision? Towards an explainable artificial intelligence (XAI) for autonomous driving systems. Transport. Res. C Emerg. Technol. 156, 104358.
Dong, J., Long, Z., Mao, X., Lin, C., He, Y., Ji, S., 2021. Multi-level alignment network for domain adaptive cross-modal retrieval. Neurocomputing 440, 207–219.
Everingham, M., Gool, L., Williams, C.K.I., Winn, J., Zisserman, A., 2010. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 88, 303–338.
Everingham, M., Winn, J., 2012. The pascal visual object classes challenge 2012 (voc2012) development kit. Pattern Anal. Stat. Model Comput. Learn, Tech. Rep. 2007, 5.
Grujicic, D., Deruyttere, T., Moens, M.F., Blaschko, M.B., 2022. Predicting physical world destinations for commands given to self-driving cars. Proc. AAAI Conf. Artif. Intell. 36, 715–725.
Hao, S., Lee, D.H., Zhao, D., 2019. Sequence to sequence learning with attention mechanism for short-term passenger flow prediction in large-scale metro system. Transport. Res. C Emerg. Technol. 107, 287–300.
Li, G., Qiu, Y., Yang, Y., Li, Z., Li, S., Chu, W., et al., 2022a. Lane change strategies for autonomous vehicles: a deep reinforcement learning approach based on transformer. IEEE Trans Intell Veh 8, 2197–2211.
Li, Z., Liao, H., Tang, R., Li, G., Li, Y., Xu, C., 2023b. Mitigating the impact of outliers in traffic crash analysis: a robust bayesian regression approach with application to tunnel crash data. Accid. Anal. Prev. 185, 107019.
Li, Z., Xu, C., Bian, Z., 2022c. A force-driven model for passenger evacuation in bus fires. Phys. Stat. Mech. Appl. 589, 126591.
Othman, K., 2021. Public acceptance and perception of autonomous vehicles: a comprehensive review. AI Ethics 1, 355–387.
This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).