Sort:
Open Access Issue
An MPI+OpenACC-Based PRM Scalar Advection Scheme in the GRAPES Model over a Cluster with Multiple CPUs and GPUs
Tsinghua Science and Technology 2022, 27 (1): 164-173
Published: 17 August 2021
Downloads:52

A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor. The Piecewise Rational Method (PRM) scalar advection scheme in the Global/Regional Assimilation and Prediction System (GRAPES) solves the moisture flux advection equation based on PRM. Computation of the scalar advection involves boundary exchange, and computation of higher bandwidth requirements is complicated and time-consuming in GRAPES. Recently, Graphics Processing Units (GPUs) have been widely used to solve scientific and engineering computing problems owing to advancements in GPU hardware and related programming models such as CUDA/OpenCL and Open Accelerator (OpenACC). Herein, we present an accelerated PRM scalar advection scheme with Message Passing Interface (MPI) and OpenACC to fully exploit GPUs’ power over a cluster with multiple Central Processing Units (CPUs) and GPUs, together with optimization of various parameters such as minimizing data transfer, memory coalescing, exposing more parallelism, and overlapping computation with data transfers. Results show that about 3.5 times speedup is obtained for the entire model running at medium resolution with double precision when comparing the scheme’s elapsed time on a node with two GPUs (NVIDIA P100) and two 16-core CPUs (Intel Gold 6142). Further, results obtained from experiments of a higher resolution model with multiple GPUs show excellent scalability.

Open Access Issue
Helmholtz Solving and Performance Optimization in Global/Regional Assimilation and Prediction System
Tsinghua Science and Technology 2021, 26 (3): 335-346
Published: 12 October 2020
Downloads:48

Despite efficient parallelism in the solution of physical parameterization in the Global/Regional Assimilation and Prediction System (GRAPES), the Helmholtz equation in the dynamic core, with the increase of resolution, can hardly achieve sufficient parallelism in the solving process due to a large amount of communication and irregular access. In this paper, optimizing the Helmholtz equation solution for better performance and higher efficiency has been an urgent task. An optimization scheme for the parallel solution of the Helmholtz equation is proposed in this paper. Specifically, the geometrical multigrid optimization strategy is designed by taking advantage of the data anisotropy of grid points near the pole and the isotropy of those near memory equator in the Helmholtz equation, and the Incomplete LU (ILU) decomposition preconditioner is adopted to speed up the convergence of the improved Generalized Conjugate Residual (GCR), which effectively reduces the number of iterations and the computation time. The overall solving performance of the Helmholtz equation is improved by thread-level parallelism, vectorization, and reuse of data in the cache. The experimental results show that the proposed optimization scheme can effectively eliminate the bottleneck of the Helmholtz equation as regards the solving speed. Considering the test results on a 10-node two-way server, the solution of the Helmholtz equation, compared with the original serial version, is accelerated by 100×, with one-third of iterations reduced.

Regular Paper Issue
Lessons Learned from Optimizing the Sunway Storage System for Higher Application I/O Performance
Journal of Computer Science and Technology 2020, 35 (1): 47-60
Published: 17 January 2020

It is hard for applications to make full utilization of the peak bandwidth of the storage system in highperformance computers because of I/O interferences, storage resource misallocations and complex long I/O paths. We performed several studies to bridge this gap in the Sunway storage system, which serves the supercomputer Sunway TaihuLight. To locate these issues and connections between them, an end-to-end performance monitoring and diagnosis tool was developed to understand I/O behaviors of applications and the system. With the help of the tool, we were about to find out the root causes of such performance barriers at the I/O forwarding layer and the parallel file system layer. An application-aware I/O forwarding allocation framework was used to address the I/O interferences and resource misallocations at the I/O forwarding layer. A performance-aware data placement mechanism was proposed to mitigate the impact of I/O interferences and performance variations of storage devices in the PFS. Together, applications obtained much better I/O performance. During the process, we also proposed a lightweight storage stack to shorten the I/O path of applications with N-N I/O pattern. This paper summarizes these studies and presents the lessons learned from the process.

total 3