Scholar - SciOpen

Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries. In pursuit of the ultimate goal of achieving artificial general intelligence, a truly intelligent Video-LLM model should not only see and understand the surroundings, but also possess human-level commonsense, and make well-informed decisions for users. To guide the development of such a model, the establishment of a robust and comprehensive evaluation system becomes crucial. To this end, this paper proposes Video-Bench, a new comprehensive benchmark along with a toolkit specifically designed for evaluating Video-LLMs. The benchmark comprises 10 meticulously crafted tasks, evaluating the capabilities of Video-LLMs across three distinct levels: video-exclusive understanding, prior knowledge-based question-answering, and comprehension and decision-making. In addition, we introduce an automatic toolkit tailored to process model outputs for various tasks, facilitating the calculation of metrics and conveniently generating final scores. We evaluate 9 representative Video-LLMs using Video-Bench. The findings reveal that current Video-LLMs still fall considerably short of achieving human-like comprehension and analysis of real-world video, and offer valuable insights for future research directions. The benchmark and toolkit are available at https://github.com/PKU-YuanGroup/Video-Bench.