AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (2 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads

Shien Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou 511442, China
School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China
Department of Computer Science and Engineering, School of Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, China
Huawei Technologies Co. Ltd., Hangzhou 310000, China
School of Future Technology, South China University of Technology, Guangzhou 511442, China

Ziqian Zeng and Jiahong Yu contribute equally to this work.

Show Author Information

Abstract

Large language Models (LLMs) have immense potential to enhance the capabilities of Cyber-Physical-Social Intelligence (CPSI) systems, enabling them to better engage with complex cyber, physical, and social environments. However, the high inference latency of LLMs, which is inherited from the autoregressive decoding process, hinders their wide application in CPSI systems. To address this challenge, current approaches have incorporated speculative decoding to enable parallel prediction of multiple subsequent tokens, thereby achieving inference acceleration. Nevertheless, the accuracy of these decoding heads falls short of the autoregressive decoding approach. In light of these limitations, we propose ResDecode, a novel speculative decoding method characterized by its efficient and accurate decoding heads. Within the lightweight draft model, we propose a residual decoding head to compensate for the full context encoder’s limited capability on long-range dependencies, thus improving accuracy. ResDecode demonstrates impressive results, achieving a maximum speedup ratio of 3.2 × on the MT-bench compared to vanilla autoregressive decoding.

References

【1】
【1】
 
 
Big Data Mining and Analytics
Pages 779-793

{{item.num}}

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Close
Close
Cite this article:
Zeng Z, Yu J, Pang Q, et al. ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads. Big Data Mining and Analytics, 2025, 8(4): 779-793. https://doi.org/10.26599/BDMA.2024.9020074

1856

Views

134

Downloads

3

Crossref

3

Web of Science

4

Scopus

0

CSCD

Received: 20 April 2024
Revised: 20 September 2024
Accepted: 12 October 2024
Published: 12 May 2025
© The author(s) 2025.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).