Scholar - SciOpen

Large language models (LLMs) have achieved remarkable progress in natural language processing, but their immense scale leads to significant computational and storage overheads, limiting their deployment and widespread application in resource-constrained environments. Model quantization, as an effective model compression technique, significantly reduces LLMs' memory footprint and computational requirements by lowering the numerical precision of model parameters and/or activations, while striving to maintain minimal performance loss. This survey aims to comprehensively review the latest advancements in LLM quantization, covering various techniques from the pretraining phase to the inference phase. We will delve into state of the art quantization during pretraining, post-training quantization, and quantization-aware training in quantization fine-tuning, and various quantization methods during inference. Through in-depth analysis of these methods, this survey seeks to provide researchers and engineers with a comprehensive understanding of LLM quantization techniques to identify future research directions and offers an insight of how to generate high performance low-precision kernels in different chips.

Open Access Issue

A Hybrid Platform for Multi-Neurons Model with Optimum Co-Design Method

Song Wang, Keao Qiao, Jindong Liang, Nianchen Hua, Rui Lu, Jidong Zhai, Jing Pei

Big Data Mining and Analytics 2026, 9(2): 596-610

Published: 09 February 2026

Abstract

PDF (9.8 MB) Collect Collected

Downloads：122

Hybrid neuromorphic computing, integrating Artificial Neural Networks (ANNs) and Spiking Neural Networks (SNNs), is a key approach to advancing Artificial General Intelligence (AGI). Current hybrid platforms are limited to Leaky Integrate-and-Fire (LIF) based SNNs, missing crucial biological neuron behaviors like bursting and adaptation. We propose a hybrid platform based on the TianjicX chip, enabling heterogeneous integration of multiple SNN models (LIF, Quadratic Integrate-and-Fire (QIF), and Izhikevich) alongside ANNs. Our platform employs a co-design strategy for computing and storage mechanisms, minimizing data movement. Simulations show that the co-design approach reduces energy consumption by 8.11% (48.67 mW) compared to TianjicX. The platform also demonstrates superior computational performance across SNN models. It achieves 95% classification accuracy on the MNIST dataset (3000 images, each being 28 pixel×28 pixel and single presentation), surpassing Open Date Index Name (ODIN) by 10.5%. This is achieved with a two-layer fully-connected Izhikevich network (784×800×10), where each synapse operates at 8-bit precision. The network processes 33900 images per second, using only 35 cores (21.88% of 160 cores) and delivering 896 billion operations per second. Furthermore, on ResNet-50, our platform shows a 3.12% increase in computing speed and 40.85 mW/frame reduction in energy consumption compared to the TianjicX chip.

Open Access Review Issue

A Survey on Accelerated Technologies for Mixture-of-Experts Model Training Systems

Qi Zhang, Jidong Zhai, Weimin Zheng

Tsinghua Science and Technology 2026, 31(3): 1411-1439

Published: 19 December 2025

Abstract

PDF (1.4 MB) Collect Collected

Downloads：488

Mixture-of-Experts (MoE) models have emerged as a transformative paradigm for scaling Large Language Models (LLMs), enabling unprecedented model capacity while maintaining computational efficiency through sparse activation mechanisms. However, the unique architectural characteristics of MoE models introduce significant system-level challenges that fundamentally differ from traditional dense models. These challenges necessitate specialized system optimizations tailored to MoE’s distinctive properties. This survey systematically analyzes accelerated technologies for MoE training systems, discussing recent advances across four critical optimization dimensions: hybrid parallel computing, comprehensive memory management, fine-grained communication scheduling, and adaptive load balancing. Our analysis reveals a paradigm shift from computation-centric to workload-centric optimization strategies. What’s more, we identify emerging research directions including machine learning-guided load balancing, cross-layer optimization frameworks, and hardware-software co-design for MoE training workloads. This work aims to provide researchers and system engineers with a comprehensive technical reference to support the design of more efficient and scalable next-generation MoE training systems.

Open Access Issue

Accelerating Distributed Training of Large Concurrent-Branch Models through Bidirectional Pipeline Coordination

Zan Zong, Yuyang Chen, Qi Zhang, Daming Zhao, Jianjiang Li, Yijun Jing, Jidong Zhai

Tsinghua Science and Technology 2025, 30(6): 2638-2652

Published: 04 July 2025

Abstract

PDF (2.3 MB) Collect Collected

Downloads：202

Large models have been widely used in the field of neural language processing, information retrieving, etc. With the development of the large models, not only is the parameter scale increased, but the model architecture has also become more complex. For example, the multi-modal transformer-based model mainly has concurrent branches, which we denoted as the concurrent branch model (CBM). Many CBMs have enlarged to tens of billions of parameters, and require distributed resources to train this kind of model. Existing distributed training systems cannot fully handle this type of model architecture because there are interactions between branches. Inspired by the unbalanced resource usage of pipeline parallelism, we prefer to organize different branches with a fine-grained bidirectional pipeline schedule of communication and computation. However, improper coordination between branches leads to idle time for computation and low training efficiency. In this paper, we present Flexpipe, a pipeline engine for c3oncurrent-branch models. We first introduce a branch-aware pipeline parallelism (BAPP) to make full use of the concurrent characteristic of the model architecture. Then, based on a multi-branch pipeline simulator, we propose an adaptive interaction coordinator, which facilitates the low-overhead branch interactions during the distributed model training. We evaluate our approach on popular concurrent branch models combined with modern training systems. Compared with the Chimera, the experiential results show that our method improves the end-to-end training throughput by 20% on average.

Open Access Issue

Training Large Models on Heterogeneous and Geo-Distributed Resource with Constricted Networks

Zan Zong, Minkun Guo, Mingshu Zhai, Yinan Tang, Jianjiang Li, Jidong Zhai

Big Data Mining and Analytics 2025, 8(4): 966-980

Published: 12 May 2025

Abstract

PDF (1.4 MB) Collect Collected

Downloads：388

As the computational demands driven by large model technologies continue to grow rapidly, leveraging GPU hardware to expedite parallel training processes has emerged as a commonly-used strategy. When computational resources within a single cluster are insufficient for large-model training, the hybrid utilization of heterogeneous acceleration hardware has emerged as a promising technical solution. The utilization of heterogeneous acceleration hardware and scheduling of diverse cloud resources have become a focal point of considerable interest. However, these computing resources are often geographically distributed. Due to the lack of awareness of heterogeneous devices and network topologies, existing parallel training frameworks struggle to leverage mixed GPU resources across constrained networks effectively. To boost the computing capability of the connected heterogeneous clusters, we propose HGTrainer, an optimizer designed to plan heterogeneous parallel strategies across distributed clusters for large model training. HGTrainer can adaptively saturate heterogeneous clusters because of the expanded tunable parallelism space for heterogeneous accelerators, with the awareness of relatively lower inter-cluster bandwidth. To achieve this goal, we formulate the model partitioning problem among heterogeneous hardware and introduce a hierarchical searching algorithm to solve the optimization problem. Besides, a mixed-precision pipeline method is used to reduce the cost of inter-cluster communications. We evaluate HGTrainer on heterogeneous connected clusters with popular large language models. The experimental result shows that HGTrainer effectively improves 1.49× training throughput on average for the mixed heterogeneous cluster compared with the state-of-the-art Metis.

Perspective Issue

Unified Programming Models for Heterogeneous High-Performance Computers

Zi-Xuan Ma, Yu-Yang Jin, Shi-Zhi Tang, Hao-Jie Wang, Wei-Cheng Xue, Ji-Dong Zhai, Wei-Min Zheng

Journal of Computer Science and Technology 2023, 38(1): 211-218

Published: 28 February 2023

Abstract Collect Collected

Unified programming models can effectively improve program portability on various heterogeneous high-performance computers. Existing unified programming models put a lot of effort to code portability but are still far from achieving good performance portability. In this paper, we present a preliminary design of a performance-portable unified programming model including four aspects: programming language, programming abstraction, compilation optimization, and scheduling system. Specifically, domain-specific languages introduce domain knowledge to decouple the optimizations for different applications and architectures. The unified programming abstraction unifies the common features of different architectures to support common optimizations. Multi-level compilation optimization enables comprehensive performance optimization based on multi-level intermediate representations. Resource-aware lightweight runtime scheduling system improves the resource utilization of heterogeneous computers. This is a perspective paper to show our viewpoints on programming models for emerging heterogeneous systems.

Issue

Efficient memory allocator for the New Generation Sunway supercomputer

Haojie WANG, Zixuan MA, Liyan ZHENG, Yuanwei WANG, Fei WANG, Jidong ZHAI

Journal of Tsinghua University (Science and Technology) 2022, 62(5): 943-951

Published: 15 May 2022

Abstract

PDF (5.9 MB) Collect Collected

Downloads：56

Supercomputers provide enormous computing power for large applications. Traditional supercomputers have mainly targeted scientific computing problems. However, other applications have new requirements for the both supercomputer software and hardware designs. The New Generation Sunway supercomputer has an inefficient memory allocator when running in the dynamic mode. This study develops an efficient memory allocator, SWAlloc, that reduces the memory allocation time of the brain scale pretrained model training framework, BaGuaLu, by up to 75 839 times. Evaluations using PARSEC also show that SWAlloc can speed up the memory allocation by up to 51 times (36% on average). SWAlloc has been deployed on the New Generation Sunway supercomputer for use by various large applications, including SWPytorch and SWTensorFlow.

Open Access Issue

AIPerf: Automated Machine Learning as an AI-HPC Benchmark

Zhixiang Ren, Yongheng Liu, Tianhui Shi, Lei Xie, Yue Zhou, Jidong Zhai, Youhui Zhang, Yunquan Zhang, Wenguang Chen

Big Data Mining and Analytics 2021, 4(3): 208-220

Published: 12 May 2021

Abstract

PDF (10.3 MB) Collect Collected

Downloads：233

The plethora of complex Artificial Intelligence (AI) algorithms and available High-Performance Computing (HPC) power stimulates the expeditious development of AI components with heterogeneous designs. Consequently, the need for cross-stack performance benchmarking of AI-HPC systems has rapidly emerged. In particular, the de facto HPC benchmark, LINPACK, cannot reflect the AI computing power and input/output performance without a representative workload. Current popular AI benchmarks, such as MLPerf, have a fixed problem size and therefore limited scalability. To address these issues, we propose an end-to-end benchmark suite utilizing automated machine learning, which not only represents real AI scenarios, but also is auto-adaptively scalable to various scales of machines. We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems with customizable configurations. We utilize Operations Per Second (OPS), which is measured in an analytical and systematic approach, as a major metric to quantify the AI performance. We perform evaluations on various systems to ensure the benchmark’s stability and scalability, from 4 nodes with 32 NVIDIA Tesla T4 (56.1 Tera-OPS measured) up to 512 nodes with 4096 Huawei Ascend 910 (194.53 Peta-OPS measured), and the results show near-linear weak scalability. With a flexible workload and single metric, AIPerf can easily scale on and rank AI-HPC, providing a powerful benchmark suite for the coming supercomputing era.

Total 8