AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (2.5 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Dynamic Batch Processing with FlexiDecode Scheduler for Efficient LLM Inference in IIoT

Key Laboratory of Computing Power Network and Information Security of Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan 250000, China, and also with Shandong Provincial Key Laboratory of Computer Power Internet and Service Computing, Shandong Fundamental Research Center for Computer Science, Jinan 250000, China
Department of Computing Technologies, Swinburne University of Technology, Melbourne 3000, Australia
Show Author Information

Abstract

Large Language Models (LLMs) are expanding their applications across various fields, including Industrial Internet of Things (IIoT), where they analyze sensor data, automate diagnostics, and enhance predictive maintenance. LLM inference is provided by service providers to users, with each inference request undergoing two phases: prefill and decode. Due to the autoregressive nature of generation, only one token can be produced per iteration, necessitating multiple iterations to complete a request. Typically, batch processing groups multiple requests into a single batch for inference, improving throughput and hardware utilization. However, in service systems, a fixed batch size presents challenges under fluctuating request volumes, particularly in IIoT environments, where data flow can vary significantly. Specifically, during the high-load periods, a fixed batch size may lead to underutilization of resources, while during the low-load periods, it may result in resource wastage. In this paper, we introduce FlexiDecode Scheduler (FDS) to address these challenges by dynamically adjusting the decoding batch size based on system load conditions, improving resource utilization, and reducing wait time during high-load periods. FDS prioritizes prefilling new requests to maximize decoding efficiency and employs a request output length predictor to optimize request scheduling, minimizing End-to-End (E2E) latency. Compared to virtual Large Language Model (vLLM) and Sarathi, our approach achieves a 23% and 16% reduction in E2E latency, improves actual request execution time by 34% and 15%, respectively, and increases computational utilization by 10%.

References

【1】
【1】
 
 
Big Data Mining and Analytics
Pages 1307-1323

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Jia X, Gu B, Chen J, et al. Dynamic Batch Processing with FlexiDecode Scheduler for Efficient LLM Inference in IIoT. Big Data Mining and Analytics, 2025, 8(6): 1307-1323. https://doi.org/10.26599/BDMA.2025.9020025

3079

Views

87

Downloads

2

Crossref

0

Web of Science

0

Scopus

0

CSCD

Received: 23 November 2024
Revised: 25 December 2024
Accepted: 24 February 2025
Published: 19 September 2025
© The author(s) 2025.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).