Leveraging Large-Scale Data for Efficient Low-Bit CUTLASS GEMM Optimization via Neural Networks

Hong Guo; Nianhui Guo; Christoph Meinel; Haojin Yang

doi:10.26599/BDMA.2025.9020065

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (5.8 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

Leveraging Large-Scale Data for Efficient Low-Bit CUTLASS GEMM Optimization via Neural Networks

Hong Guo, Nianhui Guo, Christoph Meinel, Haojin Yang(

)

Hasso Plattner Institute for Digital Engineering gGmbH, University of Potsdam, Potsdam 14482, Germany

Show Author Information

Abstract

Optimizing GEneral Matrix Multiplication (GEMM) on GPU platforms is becoming increasingly critical to meet the growing computational demands of modern deep neural network research. While significant progress has been made in accelerating high-precision GEMM, the optimization of low-bit GEMM remains a challenging open problem. The CUTLASS library provides highly optimized low-bit GEMM templates leveraging Tensor Cores; however, performance varies considerably depending on tile and pipeline configurations across different GPU architectures. In this work, we propose a novel auto-tuning framework for low-bit CUTLASS GEMM, utilizing a neural network model to predict optimal GEMM template parameters for target GPUs. Our model is trained on a synthetic dataset with up to 116100 unique samples, encompassing diverse matrix sizes across various Ampere GPUs, and is thoroughly evaluated on these hardware platforms. Experimental results show that our method achieves an accuracy of up to 95.11% on the validation dataset. Furthermore, real-time evaluations of low-bit data types on the A100 GPU demonstrate speedups of up to 1.99× for GEMM operations and 1.28× for the linear layer, compared to the default CUTLASS templates.

Keywords

Low-bit GEneral Matrix Multiplication (GEMM)CUTLASS optimization neural network auto-tuning Tensor Cores tile and pipeline large-scale dataset

References

【1】

Crossref Google Scholar

Big Data Mining and Analytics

Volume 9 Issue 2,
April 2026

Pages 632-652

DOI: 10.26599/BDMA.2025.9020065

	{{item.num}}
{{version.versionName}} Author Response
{{version.versionName}} Review comment

Comments on this article

Go to comment

< Back to all reports

Review Status: {{reviewData.commendedNum}} Commended , {{reviewData.revisionRequiredNum}} Revision Required , {{reviewData.notCommendedNum}} Not Commended Under Peer Review

Review Comment

Cite this Report

. . , , {{reviewData.reportCite.doi}}

Cite this article:

Guo H, Guo N, Meinel C, et al. Leveraging Large-Scale Data for Efficient Low-Bit CUTLASS GEMM Optimization via Neural Networks. Big Data Mining and Analytics, 2026, 9(2): 632-652. https://doi.org/10.26599/BDMA.2025.9020065

1221

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Received: 24 January 2025

Revised: 05 May 2025

Accepted: 23 May 2025

Published: 09 February 2026

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).