MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Mianxin Liu; Weiguo Hu; Jinru Ding; Jie Xu; Xiaoyang Li; Lifeng Zhu; Zhian Bai; Xiaoming Shi; Benyou Wang; Haitao Song; Pengfei Liu; Xiaofan Zhang; Shanshan Wang; Kang Li; Haofen Wang; Tong Ruan; Xuanjing Huang; Xin Sun; Shaoting Zhang

doi:10.26599/BDMA.2024.9020044

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (3.7 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Mianxin Liu^{¹^,^M}, Weiguo Hu^{²^,^M}, Jinru Ding^¹, Jie Xu^¹, Xiaoyang Li^², Lifeng Zhu^², Zhian Bai^², Xiaoming Shi^¹, Benyou Wang^³, Haitao Song^⁴, Pengfei Liu^⁵, Xiaofan Zhang^⁶, Shanshan Wang^⁷, Kang Li^⁸, Haofen Wang^⁹, Tong Ruan^¹⁰, Xuanjing Huang^¹¹, Xin Sun^¹², Shaoting Zhang^¹(

)

1Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China

2Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China

3Chinese University of Hong Kong, Shenzhen 518172, China

4Shanghai Artificial Intelligence Research Institute, Shanghai 200240, and also with Shanghai Jiao Tong University, Shanghai 200240, China

5School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

6Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai 200240, China

7Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

8West China Hospital, Sichuan University, Chengdu 610041, China

9School of Design and Innovation, Tongji University, Shanghai 200092, China

10Department of Computer Science and Technology, East China University of Science and Technology, Shanghai 200237, China

11School of Computer Science, Fudan University, Shanghai 200433, China

12Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai 200092, China

Mianxin Liu and Weiguo Hu contribute equally to this paper.

Show Author Information

Abstract

Ensuring the general efficacy and benefit for human beings from medical Large Language Models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce “MedBench”, a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300901 questions) to cover 43 clinical specialties, and performs multi-faceted evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations between question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer memorization. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals’ perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.

Keywords

Medical Large Language Model (MLLM)benchmark platform open-source

Electronic Supplementary Material

Download File(s)

BDMA-2024-0079_ESM.pdf (245.8 KB)

References

【1】

Crossref Google Scholar

Big Data Mining and Analytics

Volume 7 Issue 4,
December 2024

Pages 1116-1128

DOI: 10.26599/BDMA.2024.9020044

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Liu M, Hu W, Ding J, et al. MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models. Big Data Mining and Analytics, 2024, 7(4): 1116-1128. https://doi.org/10.26599/BDMA.2024.9020044

8225

Views

1522

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 06 February 2024

Revised: 09 May 2024

Accepted: 11 June 2024

Published: 04 December 2024

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).