Dynamic Batch Processing with FlexiDecode Scheduler for Efficient LLM Inference in IIoT

Xiaocong Jia; Bruce Gu; Jinjun Chen; Longxiang Gao; Weiguang Pang; Guangtong Lv; Youyang Qu; Lei Cui

doi:10.26599/BDMA.2025.9020025

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (2.5 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Dynamic Batch Processing with FlexiDecode Scheduler for Efficient LLM Inference in IIoT

Xiaocong Jia^¹, Bruce Gu^¹(

), Jinjun Chen^², Longxiang Gao^¹(

), Weiguang Pang^¹, Guangtong Lv^¹, Youyang Qu^¹, Lei Cui^¹

1Key Laboratory of Computing Power Network and Information Security of Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan 250000, China, and also with Shandong Provincial Key Laboratory of Computer Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Jinan 250000, China

3Department of Computing Technologies, Swinburne University of Technology, Melbourne 3000, Australia

Show Author Information

Abstract

Large Language Models (LLMs) are expanding their applications across various fields, including Industrial Internet of Things (IIoT), where they analyze sensor data, automate diagnostics, and enhance predictive maintenance. LLM inference is provided by service providers to users, with each inference request undergoing two phases: prefill and decode. Due to the autoregressive nature of generation, only one token can be produced per iteration, necessitating multiple iterations to complete a request. Typically, batch processing groups multiple requests into a single batch for inference, improving throughput and hardware utilization. However, in service systems, a fixed batch size presents challenges under fluctuating request volumes, particularly in IIoT environments, where data flow can vary significantly. Specifically, during the high-load periods, a fixed batch size may lead to underutilization of resources, while during the low-load periods, it may result in resource wastage. In this paper, we introduce FlexiDecode Scheduler (FDS) to address these challenges by dynamically adjusting the decoding batch size based on system load conditions, improving resource utilization, and reducing wait time during high-load periods. FDS prioritizes prefilling new requests to maximize decoding efficiency and employs a request output length predictor to optimize request scheduling, minimizing End-to-End (E2E) latency. Compared to virtual Large Language Model (vLLM) and Sarathi, our approach achieves a 23% and 16% reduction in E2E latency, improves actual request execution time by 34% and 15%, respectively, and increases computational utilization by 10%.

Keywords

virtual Large Language Model (vLLM) inference batch scheduling dynamic decoding batches calculating utilization

References

【1】

Crossref Google Scholar

Big Data Mining and Analytics

Volume 8 Issue 6,
December 2025

Pages 1307-1323

DOI: 10.26599/BDMA.2025.9020025

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Jia X, Gu B, Chen J, et al. Dynamic Batch Processing with FlexiDecode Scheduler for Efficient LLM Inference in IIoT. Big Data Mining and Analytics, 2025, 8(6): 1307-1323. https://doi.org/10.26599/BDMA.2025.9020025

4343

Views

107

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 23 November 2024

Revised: 25 December 2024

Accepted: 24 February 2025

Published: 19 September 2025

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).