ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads

Ziqian Zeng; Jiahong Yu; Qianshi Pang; Zihao Wang; Huiping Zhuang; Fan Yu; Hongen Shao; Xiaofeng Zou

doi:10.26599/BDMA.2024.9020074

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads

Ziqian Zeng^{¹^,^Z}, Jiahong Yu^{²^,^Z}, Qianshi Pang^², Zihao Wang^³, Huiping Zhuang^¹, Fan Yu^⁴, Hongen Shao^⁵, Xiaofeng Zou^⁵(

)

1Shien Ming Wu School of Intelligent Engineering, South China University of Technology, Guangzhou 511442, China

2School of Computer Science and Engineering, South China University of Technology, Guangzhou 510006, China

3Department of Computer Science and Engineering, School of Engineering, The Hong Kong University of Science and Technology, Hong Kong 999077, China

4Huawei Technologies Co. Ltd., Hangzhou 310000, China

5School of Future Technology, South China University of Technology, Guangzhou 511442, China

Ziqian Zeng and Jiahong Yu contribute equally to this work.

Show Author Information

Abstract

Large language Models (LLMs) have immense potential to enhance the capabilities of Cyber-Physical-Social Intelligence (CPSI) systems, enabling them to better engage with complex cyber, physical, and social environments. However, the high inference latency of LLMs, which is inherited from the autoregressive decoding process, hinders their wide application in CPSI systems. To address this challenge, current approaches have incorporated speculative decoding to enable parallel prediction of multiple subsequent tokens, thereby achieving inference acceleration. Nevertheless, the accuracy of these decoding heads falls short of the autoregressive decoding approach. In light of these limitations, we propose ResDecode, a novel speculative decoding method characterized by its efficient and accurate decoding heads. Within the lightweight draft model, we propose a residual decoding head to compensate for the full context encoder’s limited capability on long-range dependencies, thus improving accuracy. ResDecode demonstrates impressive results, achieving a maximum speedup ratio of 3.2 $\times$ on the MT-bench compared to vanilla autoregressive decoding.

Keywords

speculative decoding efficient inference Large Language Models (LLMs)

References

【1】

Crossref Google Scholar

Big Data Mining and Analytics

Volume 8 Issue 4,
August 2025

Pages 779-793

DOI: 10.26599/BDMA.2024.9020074

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Zeng Z, Yu J, Pang Q, et al. ResDecode: Accelerating Large Language Models Inference via Residual Decoding Heads. Big Data Mining and Analytics, 2025, 8(4): 779-793. https://doi.org/10.26599/BDMA.2024.9020074

2079

Views

144

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 20 April 2024

Revised: 20 September 2024

Accepted: 12 October 2024

Published: 12 May 2025

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).