AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (3.7 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
Chinese University of Hong Kong, Shenzhen 518172, China
Shanghai Artificial Intelligence Research Institute, Shanghai 200240, and also with Shanghai Jiao Tong University, Shanghai 200240, China
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai 200240, China
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
West China Hospital, Sichuan University, Chengdu 610041, China
School of Design and Innovation, Tongji University, Shanghai 200092, China
Department of Computer Science and Technology, East China University of Science and Technology, Shanghai 200237, China
School of Computer Science, Fudan University, Shanghai 200433, China
Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai 200092, China

Mianxin Liu and Weiguo Hu contribute equally to this paper.

Show Author Information

Abstract

Ensuring the general efficacy and benefit for human beings from medical Large Language Models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce “MedBench”, a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300901 questions) to cover 43 clinical specialties, and performs multi-faceted evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations between question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer memorization. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals’ perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.

Electronic Supplementary Material

Download File(s)
BDMA-2024-0079_ESM.pdf (245.8 KB)

References

【1】
【1】
 
 
Big Data Mining and Analytics
Pages 1116-1128

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Liu M, Hu W, Ding J, et al. MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models. Big Data Mining and Analytics, 2024, 7(4): 1116-1128. https://doi.org/10.26599/BDMA.2024.9020044

8225

Views

1522

Downloads

20

Crossref

12

Web of Science

16

Scopus

0

CSCD

Received: 06 February 2024
Revised: 09 May 2024
Accepted: 11 June 2024
Published: 04 December 2024
© The author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).