Journal Home > Volume 27 , Issue 1

A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor. The Piecewise Rational Method (PRM) scalar advection scheme in the Global/Regional Assimilation and Prediction System (GRAPES) solves the moisture flux advection equation based on PRM. Computation of the scalar advection involves boundary exchange, and computation of higher bandwidth requirements is complicated and time-consuming in GRAPES. Recently, Graphics Processing Units (GPUs) have been widely used to solve scientific and engineering computing problems owing to advancements in GPU hardware and related programming models such as CUDA/OpenCL and Open Accelerator (OpenACC). Herein, we present an accelerated PRM scalar advection scheme with Message Passing Interface (MPI) and OpenACC to fully exploit GPUs’ power over a cluster with multiple Central Processing Units (CPUs) and GPUs, together with optimization of various parameters such as minimizing data transfer, memory coalescing, exposing more parallelism, and overlapping computation with data transfers. Results show that about 3.5 times speedup is obtained for the entire model running at medium resolution with double precision when comparing the scheme’s elapsed time on a node with two GPUs (NVIDIA P100) and two 16-core CPUs (Intel Gold 6142). Further, results obtained from experiments of a higher resolution model with multiple GPUs show excellent scalability.


menu
Abstract
Full text
Outline
About this article

An MPI+OpenACC-Based PRM Scalar Advection Scheme in the GRAPES Model over a Cluster with Multiple CPUs and GPUs

Show Author's information Huadong Xiao( )Yang LuJianqiang HuangWei Xue
Institute of Geodesy and Geophysics, Chinese Academy of Sciences, Wuhan 430074, China
University of Chinese Academy of Sciences, Beijing 100049, China
National Meteorological Information Center, Beijing 100081, China
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Department of Computer Technology and Application, Qinghai University, Xining 810016, China

Abstract

A moisture advection scheme is an essential module of a numerical weather/climate model representing the horizontal transport of water vapor. The Piecewise Rational Method (PRM) scalar advection scheme in the Global/Regional Assimilation and Prediction System (GRAPES) solves the moisture flux advection equation based on PRM. Computation of the scalar advection involves boundary exchange, and computation of higher bandwidth requirements is complicated and time-consuming in GRAPES. Recently, Graphics Processing Units (GPUs) have been widely used to solve scientific and engineering computing problems owing to advancements in GPU hardware and related programming models such as CUDA/OpenCL and Open Accelerator (OpenACC). Herein, we present an accelerated PRM scalar advection scheme with Message Passing Interface (MPI) and OpenACC to fully exploit GPUs’ power over a cluster with multiple Central Processing Units (CPUs) and GPUs, together with optimization of various parameters such as minimizing data transfer, memory coalescing, exposing more parallelism, and overlapping computation with data transfers. Results show that about 3.5 times speedup is obtained for the entire model running at medium resolution with double precision when comparing the scheme’s elapsed time on a node with two GPUs (NVIDIA P100) and two 16-core CPUs (Intel Gold 6142). Further, results obtained from experiments of a higher resolution model with multiple GPUs show excellent scalability.

Keywords: Global/Regional Assimilation and Prediction System (GRAPES), Message Passing Interface (MPI), Graphics Processing Unit (GPU) computing, Open Accelerator (OpenACC), Piecewise Rational Method (PRM) scalar advection scheme

References(25)

[1]
W. H. Raymond, Moisture advection using relative humidity, J. Appl. Meteor., vol. 39, no. 12, pp. 2397-2408, 2000.
[2]
Y. Su, X. S. Shen, X. D. Peng, X. L. Li, X. J. Wu, S. Zhang, and X. Chen, Application of PRM scalar advection scheme in GRAPES global forecast system, (in Chinese), Chinese Journal of Atmospheric Sciences, vol. 37, no. 6, pp. 1309-1325, 2013.
[3]
D. H. Chen, J. S. Xue, X. S. Yang, H. L. Zhang, X. S. Shen, J. L. Hu, Y. Wang, L. R. Ji, and J. B. Che, New generation of multi-scale NWP system (GRAPES): General scientific design, Chin. Sci. Bull., vol. 53, no. 22, pp. 3433-3445, 2008.
[4]
P. Bauer, A. Thorpe, and G. Brunet, The quiet revolution of numerical weather prediction, Nature, vol. 525, no. 7567, pp. 47-55, 2015.
[5]
J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, GPU computing, Proc. IEEE, vol. 96, no. 5, pp. 879-899, 2008.
[6]
M. W. Govett, J. Middlecoff, and T. Henderson, Running the NIM next-generation weather model on GPUs, in Proc. 2010 10th IEEE/ACM Int. Conf. Cluster, Cloud and Grid Computing, Melbourne, Australia, 2010, pp. 792-796.
[7]
J. Michalakes and M. Vachharajani, GPU acceleration of numerical weather prediction, Parallel Process. Lett., vol. 18, no. 4, pp. 531-548, 2008.
[8]
T. Henderson, J. Middlecoff, J. Rosinski, M. Govett, and P. Madden, Experience applying fortran GPU compilers to numerical weather prediction, in Proc. 2011 Symp. Application Accelerators in High-Performance Computing, Knoxville, TN, USA, 2011, pp. 34-41.
[9]
I. Demeshko, N. Maruyama, H. Tomita, and S. Matsuoka, Multi-GPU implementation of the NICAM atmospheric model, in Proc. 18th Int. Conf. Parallel Processing Workshops, Rhodes Island, Greece, 2012, pp. 175-184.
[10]
C. Yang, W. Xue, H. H. Fu, L. Gan, L. F. Li, Y. T. Xu, Y. T. Lu, J. C. Sun, G. W. Yang, and W. M. Zheng, A peta-scalable CPU-GPU algorithm for global atmospheric simulations, in Proc. 18th ACM SIGPLAN Symp. Principles and Practice of Parallel Programming (PPoPP’13), Shenzhen, China, 2013, vol. 48, pp. 1-12.
[11]
X. Lapillonne and O. Fuhrer, Using compiler directives to port large scientific applications to GPUs: An example from atmospheric science, Parallel Process. Lett., vol. 24, no. 1, p. 1450003, 2014.
[12]
M. Norman, J. Larkin, A. Vose, and K. J. Evans, A case study of CUDA Fortran and OpenACC for an atmospheric climate kernel, J. Comput. Sci., vol. 9, pp. 1-6, 2015.
[13]
M. Huang, J. Mielikainen, B. Huang, H. Chen, H. L. A. Huang, and M. D. Goldberg, Development of efficient GPU parallelization of WRF Yonsei University planetary boundary layer scheme, Geosci. Model Dev., vol. 8, pp. 2977-2990, 2015.
[14]
O. Fuhrer, T. Chadha, T. Hoefler, G. Kwasniewski, X. Lapillonne, D. Leutwyler, D. Lüthi, C. Osuna, C. Schär, T. C. Schulthess, et al., Near-global climate simulation at 1km resolution: Establishing a performance baseline on 4888 GPUs with COSMO 5.0, Geosci. Model Dev., vol. 11, pp. 1665-1681, 2018.
[15]
J. Q. Huang, W. T. Han, X. Y. Wang, and W. G. Chen, Heterogeneous parallel algorithm design and performance optimization for WENO on the Sunway Taihulight supercomputer, Tsinghua Science and Technology, vol. 25, no. 1, pp. 56-67, 2020.
[16]
R. Kelly, GPU computing for atmospheric modeling, Comput. Sci. Eng., vol. 12, no. 4, pp. 26-33, 2010.
[17]
S. Saarinen, D. Salmond, and R. Forbes, Preparation of IFS physics for future architectures, presented at the 16th ECMWF Workshop on High Performance Computing in Meteorology, Reading, UK, 2014.
[18]
I. Carpenter, R. K. Archibald, K. J. Evans, J. Larkin, P. Micikevicius, M. Norman, J. Rosinski, J. Schwarzmeier, and M. A. Taylor, Progress towards accelerating HOMME on hybrid multi-core systems, International Journal of High Performance Computing Applications, vol. 27, no. 3, pp. 335-347, 2013.
[19]
W. Vanderbauwhede and T. Takemi, An investigation into the feasibility and benefits of GPU/Multicore acceleration of the weather research and forecasting model, in Proc. 2013 Int. Conf. High Performance Computing & Simulation (HPCS), Helsinki, Finland, 2013, pp. 482-489.
[20]
R. Farber, Parallel Programming with OpenACC. Cambridge, MA, USA: Morgan Kaufmann Publishers, 2016.
[21]
J. S. Xue and D. H. Chen, Scientific Design and Application of Numerical Prediction System GRAPES, (in Chinese). Beijing, China: Science Press, 2008.
[22]
P. Colella and P. R. Woodward, The piecewise parabolic method (PPM) for gas-dynamical simulations, J. Comput. Phys., vol. 54, no. 1, pp. 174-201, 1984.
[23]
T. Shimokawabe, T. Aoki, C. Muroi, J. Ishida, K. Kawano, T. Endo, A. Nukada, N. Maruyama, and S. Matsuoka, An 80-Fold speedup, 15.0 TFlops full GPU acceleration of non-hydrostatic weather model ASUCA production code, in Proc. 2010 ACM/IEEE Int. Conf. High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, 2010, pp. 1-11.
[24]
S. W. Williams, A. Waterman, and D. A. Patterson, Roofline: An insightful visual performance model for multicore architectures, Communications of the ACM, vol. 52, no. 4, pp. 65-76, 2009.
[25]
V. Volkov, Better performance at lower occupancy, presented at GPU Technology Conf. 2010, San Jose, CA, USA, 2010.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 30 July 2020
Accepted: 18 August 2020
Published: 17 August 2021
Issue date: February 2022

Copyright

© The author(s) 2022

Acknowledgements

This work was partially supported by the decision support project of response to climate change of China, the National Natural Science Foundation of China (Nos. 41674085, 41604009, and 41621091), the Natural Science Foundation of Qinghai Province (No. 2019-ZJ-7034), and the Open Project of State Key Laboratory of Plateau Ecology and Agriculture, Qinghai University (No. 2020-zz-03).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return