KnowBench: Evaluating the Knowledge Alignment on Large Visual Language Models

Zheng Ma; Hao-Tian Yang; Jian-Bing Zhang; Jia-Jun Chen

doi:10.1007/s11390-025-5512-y

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Regular Paper

KnowBench: Evaluating the Knowledge Alignment on Large Visual Language Models

Zheng Ma^{¹^,^†}, Hao-Tian Yang^{¹^,²^,^†}, Jian-Bing Zhang^{¹^,²}(

), Jia-Jun Chen^¹

1National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China

2School of Artificial Intelligence, Nanjing University, Nanjing 210023, China

^†Equal Contributed (Zheng Ma proposed the overall research concept, provided the original data, and drafted the manuscript; Hao-Tian Yang took charge of data generation and model testing, and created the figures in the paper.)

Show Author Information

Abstract

Large visual language models (LVLMs) have revolutionized the multimodal domain, demonstrating exceptional performance in tasks requiring fusing visual and textual information. However, the current evaluation benchmarks fail to adequately assess the knowledge alignment between images and text, focusing primarily on answer accuracy rather than the reasoning processes behind them. To address this gap and enhance the understanding of LVLMs’ capabilities, we introduce KnowBench, a novel benchmark designed to assess the alignment of knowledge between images and text for LVLMs. KnowBench comprises 1 081 image-question pairs, each with four options and four pieces of corresponding knowledge across 11 major categories. We evaluate mainstream LVLMs on KnowBench, including proprietary models like Gemini, Claude, and GPT, and open-source models like LLaVA, Qwen-VL, and InternVL. Our experiments reveal a notable discrepancy in the models’ abilities to select correct answers and corresponding knowledge whether the models are open-source or proprietary. This indicates that there is still a significant gap in the current LVLMs’ knowledge alignment between images and text. Furthermore, our further analysis shows that model performance on KnowBench improves with increased parameters and version iterations. This indicates that scaling laws have a significant impact on multimodal knowledge alignment, and the iteration of the model by researchers also has a positive effect. We anticipate that KnowBench will foster the development of LVLMs and motivate researchers to develop more reliable models. We have made our dataset publicly available at https://doi.org/10.57760/sciencedb.29672.

Keywords

large visual language model (LVLM)knowledge alignment image and text fusing evaluation benchmark

Electronic Supplementary Material

Download File(s)

JCST-2504-15512-Highlights.pdf (856.7 KB)

References

【1】

Crossref Google Scholar

Journal of Computer Science and Technology

Volume 40 Issue 5,
September 2025

Pages 1209-1219

DOI: 10.1007/s11390-025-5512-y

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Ma Z, Yang H-T, Zhang J-B, et al. KnowBench: Evaluating the Knowledge Alignment on Large Visual Language Models. Journal of Computer Science and Technology, 2025, 40(5): 1209-1219. https://doi.org/10.1007/s11390-025-5512-y

496

Views

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 30 April 2025

Accepted: 08 September 2025

Published: 10 September 2025