Scholar - SciOpen

Open Access Article Issue

Optimizing Fine-Tuning in Quantized Language Models: An In-Depth Analysis of Key Variables

Ao Shen, Zhiquan Lai, Dongsheng Li, Xiaoyu Hu

Computers, Materials & Continua 2025, 82(1): 307-325

Published: 31 January 2025

Abstract

PDF (2.4 MB) Collect Collected

Downloads：6

Large-scale Language Models (LLMs) have achieved significant breakthroughs in Natural Language Processing (NLP), driven by the pre-training and fine-tuning paradigm. While this approach allows models to specialize in specific tasks with reduced training costs, the substantial memory requirements during fine-tuning present a barrier to broader deployment. Parameter-Efficient Fine-Tuning (PEFT) techniques, such as Low-Rank Adaptation (LoRA), and parameter quantization methods have emerged as solutions to address these challenges by optimizing memory usage and computational efficiency. Among these, QLoRA, which combines PEFT and quantization, has demonstrated notable success in reducing memory footprints during fine-tuning, prompting the development of various QLoRA variants. Despite these advancements, the quantitative impact of key variables on the fine-tuning performance of quantized LLMs remains underexplored. This study presents a comprehensive analysis of these key variables, focusing on their influence across different layer types and depths within LLM architectures. Our investigation uncovers several critical findings: (1) Larger layers, such as MLP layers, can maintain performance despite reductions in adapter rank, while smaller layers, like self-attention layers, are more sensitive to such changes; (2) The effectiveness of balancing factors depends more on specific values rather than layer type or depth; (3) In quantization-aware fine-tuning, larger layers can effectively utilize smaller adapters, whereas smaller layers struggle to do so. These insights suggest that layer type is a more significant determinant of fine-tuning success than layer depth when optimizing quantized LLMs. Moreover, for the same discount of trainable parameters, reducing the trainable parameters in a larger layer is more effective in preserving fine-tuning accuracy than in a smaller one. This study provides valuable guidance for more efficient fine-tuning strategies and opens avenues for further research into optimizing LLM fine-tuning in resource-constrained environments.

Open Access Issue

Technologies for memory optimization for large model training on domestic platforms

Dongsheng LI, Yu TANG, Linbo QIAO, Qianru LYU

Journal of National University of Defense Technology 2026, 48(2): 284-295

Published: 01 April 2026

Abstract

PDF (1.3 MB) Collect Collected

Downloads：4

Significance

The rapid evolution of LLMs (large language models) has led to an exponential increase in parameter counts, creating a severe contradiction with the relatively slow growth of GPU memory capacity—a phenomenon often referred to as the “Memory Wall.” For domestic computing platforms in China, this challenge is particularly acute. The significance of this research lies in its systematic focus on bridging the gap between high-compute requirements and limited memory resources within the context of indigenous hardware architectures such as Ascend, Cambricon, and MT-3000. Unlike global research that primarily targets general-purpose NVIDIA platforms, this work addresses the unique structural bottlenecks of domestic chips, including restricted PCIe bandwidth, customized instruction sets, and less mature software ecosystems. By analyzing recomputation and computation offloading strategies through the lens of hardware-software co-design, the paper provides a theoretical and practical framework for achieving “technological self-reliance” in AI training. The research is vital for the domestic industry as it enables the training of trillion-parameter models on local hardware, ensuring that Chinese AI development remains competitive despite international hardware constraints. It transforms memory optimization from a simple “space-saving” exercise into a strategic balancing act between computational efficiency, hardware adaptation, and system-level throughput, which is essential for the large-scale industrial deployment of domestic AI solutions.

Progress

Current research has progressed from single-technique breakthroughs to complex, multi-technology fusion strategies. The paper highlights significant milestones in mathematical programming for memory management, such as XEngine’s use of MIQP (mixed integer quadratic programming) and OLLA’s focus on minimizing tensor residence to reduce fragmentation. A major shift is observed in the transition from “offloading-heavy” approaches to more nuanced hybrid strategies. For instance, the DELTA scheme proposed by the NUDT (National University of Defense Technology) in 2024 optimizes dynamic control flow scenarios, breaking the traditional reliance on simple data swapping. Researchers have moved toward creating middleware that abstracts the core attributes of instruction sets (like LoongISA, SVE2, and BANG) into standardized parameters, allowing recomputation operators to adapt to different chips without redundant redesigns. Furthermore, the integration of hardware-native features—such as the Da Vinci architecture’s matrix units in Huawei Ascend—with dynamic compilation techniques has significantly improved memory access efficiency for long-sequence tasks. These advancements represent a move toward “platform-customized” optimization, where algorithms are no longer generic but are deeply coupled with the underlying hardware’s memory hierarchy and interconnect bandwidth.

Conclusions and Prospects

The paper concludes that while domestic platforms face three primary hurdles—architectural heterogeneity, fragmented software ecosystems, and complex instruction modeling—these also present unique opportunities for “backward-compatible” innovation through software-defined hardware optimization. The systematic validation on platforms like MT-3000 proves that a “technology-platform-efficiency” argument chain is viable, providing a roadmap for chip manufacturers and framework developers. The researchers emphasize that the future of memory optimization lies in full-stack integration, where the boundaries between hardware scheduling, compiler optimization, and algorithmic sparsity are blurred. Moving forward, three key prospects are identified. First, the development of hardware-software synergistic memory management will focus on decoupling memory allocation from model compilation to handle massive sequence lengths. Second, the rise of MoE (mixture-of-experts) models necessitates system-level innovations in dynamic sparsity and load balancing to mitigate bandwidth limitations. Third, the growth of open-source ecosystems, exemplified by communities like DeepSeek, will be the primary catalyst for breaking the “software moat” of international competitors. The ultimate goal is to build a full-stack toolchain—covering development, deployment, and monitoring—that lowers migration costs for developers. By successfully “breaking the memory wall,” domestic platforms can transition from being functional alternatives to becoming high-performance leaders in the era of trillion-parameter model training.

Open Access Issue

Optimizing Data Distributions Based on Jensen-Shannon Divergence for Federated Learning

Zhiyao Hu, Dongsheng Li, Ke Yang, Ying Xu, Baoyun Peng

Tsinghua Science and Technology 2025, 30(2): 670-681

Published: 09 December 2024

Abstract

PDF (1.2 MB) Collect Collected

Downloads：78

In current federated learning frameworks, a central server randomly selects a small number of clients to train local models at the beginning of each global iteration. Since clients’ local data are non-dependent and identically distributed, partial local models are not consistent with the global model. Existing studies employ model cleaning methods to find inconsistent local models. Model cleaning methods measure the cosine similarity between local models and the global model. The inconsistent local model is cleaned out and will not be aggregated for the next global model. However, model cleaning methods incur negative effects such as large computation overheads and limited updates. In this paper, we propose a data distribution optimization method, called federated distribution optimization (FedDO), aiming to overcome the shortcomings of model cleaning methods. FedDO calculates the gradient of the Jensen-Shannon divergence to decrease the discrepancy between selected clients’ data distribution and the overall data distribution. We test our method on the multi-classification regression model, the multi-layer perceptron, and the convolutional neural network model on a handwritten digital image dataset. Compared with model cleaning methods, FedDO improves the training accuracy by 1.8%, 2.6%, and 5.6%, respectively.

Survey Issue

Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview

Lei Guan, Dong-Sheng Li, Ji-Ye Liang, Wen-Jian Wang, Ke-Shi Ge, Xi-Cheng Lu

Journal of Computer Science and Technology 2024, 39(3): 567-584

Published: 22 July 2024

Abstract Collect Collected

Deep learning has become the cornerstone of artificial intelligence, playing an increasingly important role in human production and lifestyle. However, as the complexity of problem-solving increases, deep learning models become increasingly intricate, resulting in a proliferation of large language models with an astonishing number of parameters. Pipeline model parallelism (PMP) has emerged as one of the mainstream approaches to addressing the significant challenge of training “big models”. This paper presents a comprehensive review of PMP. It covers the basic concepts and main challenges of PMP. It also comprehensively compares synchronous and asynchronous pipeline schedules for PMP approaches, and discusses the main techniques to achieve load balance for both intra-node and inter-node training. Furthermore, the main techniques to optimize computation, storage, and communication are presented, with potential research directions being discussed.

Open Access Issue

Decoupled Two-Phase Framework for Class-Incremental Few-Shot Named Entity Recognition

Yifan Chen, Zhen Huang, Minghao Hu, Dongsheng Li, Changjian Wang, Feng Liu, Xicheng Lu

Tsinghua Science and Technology 2023, 28(5): 976-987

Published: 19 May 2023

Abstract

PDF (9.2 MB) Collect Collected

Downloads：122

Class-Incremental Few-Shot Named Entity Recognition (CIFNER) aims to identify entity categories that have appeared with only a few newly added (novel) class examples. However, existing class-incremental methods typically introduce new parameters to adapt to new classes and treat all information equally, resulting in poor generalization. Meanwhile, few-shot methods necessitate samples for all observed classes, making them difficult to transfer into a class-incremental setting. Thus, a decoupled two-phase framework method for the CIFNER task is proposed to address the above issues. The whole task is converted to two separate tasks named Entity Span Detection (ESD) and Entity Class Discrimination (ECD) that leverage parameter-cloning and label-fusion to learn different levels of knowledge separately, such as class-generic knowledge and class-specific knowledge. Moreover, different variants, such as the Conditional Random Field-based (CRF-based), word-pair-based methods in ESD module, and add-based, Natural Language Inference-based (NLI-based) and prompt-based methods in ECD module, are investigated to demonstrate the generalizability of the decoupled framework. Extensive experiments on the three Named Entity Recognition (NER) datasets reveal that our method achieves the state-of-the-art performance in the CIFNER setting.

Open Access Issue

Efficient Knowledge Graph Embedding Training Framework with Multiple GPUs

Ding Sun, Zhen Huang, Dongsheng Li, Min Guo

Tsinghua Science and Technology 2023, 28(1): 167-175

Published: 21 July 2022

Abstract

PDF (8 MB) Collect Collected

Downloads：86

When training a large-scale knowledge graph embedding (KGE) model with multiple graphics processing units (GPUs), the partition-based method is necessary for parallel training. However, existing partition-based training methods suffer from low GPU utilization and high input/output (IO) overhead between the memory and disk. For a high IO overhead between the disk and memory problem, we optimized the twice partitioning with fine-grained GPU scheduling to reduce the IO overhead between the CPU memory and disk. For low GPU utilization caused by the GPU load imbalance problem, we proposed balanced partitioning and dynamic scheduling methods to accelerate the training speed in different cases. With the above methods, we proposed fine-grained partitioning KGE, an efficient KGE training framework with multiple GPUs. We conducted experiments on some benchmarks of the knowledge graph, and the results show that our method achieves speedup compared to existing framework on the training of KGE.

Open Access Issue

Improved Heuristic Job Scheduling Method to Enhance Throughput for Big Data Analytics

Zhiyao Hu, Dongsheng Li

Tsinghua Science and Technology 2022, 27(2): 344-357

Published: 29 September 2021

Abstract

PDF (2.9 MB) Collect Collected

Downloads：233

Data-parallel computing platforms, such as Hadoop and Spark, are deployed in computing clusters for big data analytics. There is a general tendency that multiple users share the same computing cluster. The schedule of multiple jobs becomes a serious challenge. Over a long period in the past, the Shortest-Job-First (SJF) method has been considered as the optimal solution to minimize the average job completion time. However, the SJF method leads to a low system throughput in the case where a small number of short jobs consume a large amount of resources. This factor prolongs the average job completion time. We propose an improved heuristic job scheduling method, called the Densest-Job-Set-First (DJSF) method. The DJSF method schedules jobs by maximizing the number of completed jobs per unit time, aiming to decrease the average Job Completion Time (JCT) and improve the system throughput. We perform extensive simulations based on Google cluster data. Compared with the SJF method, the DJSF method decreases the average JCT by 23.19% and enhances the system throughput by 42.19%. Compared with Tetris, the job packing method improves the job completion efficiency by 55.4%, so that the computing platforms complete more jobs in a short time span.

Open Access Issue

Increasing Momentum-Like Factors: A Method for Reducing Training Errors on Multiple GPUs

Yu Tang, Zhigang Kan, Lujia Yin, Zhiquan Lai, Zhaoning Zhang, Linbo Qiao, Dongsheng Li

Tsinghua Science and Technology 2022, 27(1): 114-126

Published: 17 August 2021

Abstract

PDF (1.7 MB) Collect Collected

Downloads：128

In distributed training, increasing batch size can improve parallelism, but it can also bring many difficulties to the training process and cause training errors. In this work, we investigate the occurrence of training errors in theory and train ResNet-50 on CIFAR-10 by using Stochastic Gradient Descent (SGD) and Adaptive moment estimation (Adam) while keeping the total batch size in the parameter server constant and lowering the batch size on each Graphics Processing Unit (GPU). A new method that considers momentum to eliminate training errors in distributed training is proposed. We define a Momentum-like Factor (MF) to represent the influence of former gradients on parameter updates in each iteration. Then, we modify the MF values and conduct experiments to explore how different MF values influence the training performance based on SGD, Adam, and Nesterov accelerated gradient. Experimental results reveal that increasing MFs is a reliable method for reducing training errors in distributed training. The analysis of convergent conditions in distributed training with consideration of a large batch size and multiple GPUs is presented in this paper.

Open Access Issue

SIGNGD with Error Feedback Meets Lazily Aggregated Technique: Communication-Efficient Algorithms for Distributed Learning

Xiaoge Deng, Tao Sun, Feng Liu, Dongsheng Li

Tsinghua Science and Technology 2022, 27(1): 174-185

Published: 17 August 2021

Abstract

PDF (2.6 MB) Collect Collected

Downloads：101

The proliferation of massive datasets has led to significant interests in distributed algorithms for solving large-scale machine learning problems. However, the communication overhead is a major bottleneck that hampers the scalability of distributed machine learning systems. In this paper, we design two communication-efficient algorithms for distributed learning tasks. The first one is named _EF-SIGNGD, in which we use the 1-bit (sign-based) gradient quantization method to save the communication bits. Moreover, the error feedback technique, i.e., incorporating the error made by the compression operator into the next step, is employed for the convergence guarantee. The second algorithm is called _LE-SIGNGD, in which we introduce a well-designed lazy gradient aggregation rule to _EF-SIGNGD that can detect the gradients with small changes and reuse the outdated information. _LE-SIGNGD saves communication costs both in transmitted bits and communication rounds. Furthermore, we show that _LE-SIGNGD is convergent under some mild assumptions. The effectiveness of the two proposed algorithms is demonstrated through experiments on both real and synthetic data.

Open Access Issue

Balance Resource Allocation for Spark Jobs Based on Prediction of the Optimal Resource

Zhiyao Hu, Dongsheng Li, Deke Guo

Tsinghua Science and Technology 2020, 25(4): 487-497

Published: 13 January 2020

Abstract

PDF (10.1 MB) Collect Collected

Downloads：166

Apache Spark provides a well-known MapReduce computing framework, aiming to fast-process big data analytics in data-parallel manners. With this platform, large input data are divided into data partitions. Each data partition is processed by multiple computation tasks concurrently. Outputs of these computation tasks are transferred among multiple computers via the network. However, such a distributed computing framework suffers from system overheads, inevitably caused by communication and disk I/O operations. System overheads take up a large proportion of the Job Completion Time (JCT). We observed that excessive computational resources incurs considerable system overheads, prolonging the JCT. The over-allocation of individual jobs not only prolongs their own JCTs, but also likely makes other jobs suffer from under-allocation. Thus, the average JCT is suboptimal, too. To address this problem, we propose a prediction model to estimate the changing JCT of a single Spark job. With the support of the prediction method, we designed a heuristic algorithm to balance the resource allocation of multiple Spark jobs, aiming to minimize the average JCT in multiple-job cases. We implemented the prediction model and resource allocation method in ReB, a Resource-Balancer based on Apache Spark. Experimental results showed that ReB significantly outperformed the traditional max-min fairness and shortest-job-optimal methods. The average JCT was decreased by around 10%-30% compared to the existing solutions.