AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Research | Open Access

Large multimodal agents: a survey

Xie Junlin^¹, Zhihong Chen^¹, Ruifei Zhang^¹, Guanbin Li^²

(

)

Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Guangdong, 518172, China

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510275, China

Show Author Information

An erratum to this article is available online at:

https://doi.org/10.1007/s44267-025-00104-y

Abstract

Large language models (LLMs) have achieved superior performance in powering text-based AI agents, endowing them with decision-making and reasoning abilities that are analogous to those exhibited by humans. Concurrently, an emerging research trend is focused on extending these LLM-powered AI agents into the multimodal domain. This extension facilitates the interpretation and response of AI agents to diverse multimodal user queries, thereby handling more intricate and nuanced tasks. In this paper, we conduct a systematic review of LLM-driven multimodal agents, which we refer to as large multimodal agents (LMAs for short). First, we introduce the essential components involved in developing LMAs and categorize the current body of research into four distinct types. Subsequently, we review the collaborative frameworks that integrate multiple LMAs, with the aim of enhancing collective efficacy. One of the critical challenges in this field is the diverse evaluation methods used across existing studies, which impedes effective comparison among different LMAs. Therefore, we compile these evaluation methodologies and establish a comprehensive framework to bridge the gaps. This framework aims to standardize evaluations, facilitating more meaningful comparisons. Concluding our review, we highlight the extensive applications of LMAs and propose potential future research directions. Our discussion aims to provide valuable insights and guidelines for future research in this rapidly evolving field.

Keywords

Large multimodal agents Comprehensive framework AI agents

References

【1】

Crossref Google Scholar

Visual Intelligence

Volume 3,
2025

Article number: 24

DOI: 10.1007/s44267-025-00093-y

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Junlin X, Chen Z, Zhang R, et al. Large multimodal agents: a survey. Visual Intelligence, 2025, 3: 24. https://doi.org/10.1007/s44267-025-00093-y

4141

Views

Crossref

Google Scholar
Citation

Received: 27 April 2025

Revised: 12 October 2025

Accepted: 16 October 2025

Published: 03 December 2025

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.