Journal Home > Volume 25 , Issue 1

A Weighted Essentially Non-Oscillatory scheme (WENO) is a solution to hyperbolic conservation laws, suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing (HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks, and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.


menu
Abstract
Full text
Outline
About this article

Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer

Show Author's information Jianqiang HuangWentao HanXiaoying WangWenguang Chen( )
State Key Laboratory of Plateau Ecology and Agriculture, Department of Computer Technology and Applications, Qinghai University, Xining 810016, China.
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.

Abstract

A Weighted Essentially Non-Oscillatory scheme (WENO) is a solution to hyperbolic conservation laws, suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing (HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks, and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.

Keywords: Sunway TaihuLight, optimization, parallel algorithms, Weighted Essentially Non-Oscillatory scheme (WENO), many-core

References(35)

[1]
A. Harten, B. Engquist, S. Osher, and S. R. Chakravarthy, Uniformly high order accurate essentially non-oscillatory schemes, III, J. Comput. Phys., vol. 71, no. 2, pp. 231-303, 1987.
[2]
X. D. Liu, S. Osher, and T. Chan, Weighted essentially non-oscillatory schemes, J. Comput. Phys., vol. 115, no. 1, pp. 200-212, 1994.
[3]
G. S. Jiang and C. W. Shu, Efficient implementation of weighted ENO scheme, J. Comput. Phys., vol. 126, no. 1, pp. 202-228, 1996.
[4]
J. C. Huang, H. Lin, T. J. Hsieh, and T. Y. Hsieh, Parallel preconditioned WENO scheme for three-dimensional flow simulation of NREL Phase VI Rotor, Comput. Fluids, vol. 45, no. 1, pp. 276-282, 2011.
[5]
L. Thais, A. E. Tejada-Martínez, T. B. Gatski, and G. Mompeana, A massively parallel hybrid scheme for direct numerical simulation of turbulent viscoelastic channel flow, Comput. Fluids, vol. 43, no. 1, pp. 134-142, 2011.
[6]
P. Kestener, F. Château, and R. Teyssier, Accelerating Euler equations numerical solver on graphics processing units, in Int. Conf. Algorithms and Architectures for Parallel Processing ICA3PP, C. H. Hsu, L. T. Yang, J. H. Park, and S. S. Yeo, eds. Springer, 2010, pp. 281-288.
DOI
[7]
J. Tölke and M. Krafczyk, TeraFLOP computing on a desktop PC with GPUs for 3D CFD, Int. J. Comput. Fluid Dynam., vol. 22, no. 7, pp. 443-456, 2008.
[8]
X. J. Yang, X. K. Liao, K. Lu, Q. F. Hu, J. Q. Song, and J. S. Su, The TianHe-1A supercomputer: Its hardware and software, J. Comp. Sci. Technol., vol. 26, no. 3, pp. 344-351, 2011.
[9]
X. K. Liao, L. Q. Xiao, C. Q. Yang, and Y. T. Lu, MilkyWay-2 supercomputer: System and application, Front. Comput. Sci., vol. 8, no. 3, pp. 345-356, 2014.
[10]
F. Zhang, J. D. Zhai, B. S. He, S. H. Zhang, and W. G. Chen, Understanding co-running behaviors on integrated CPU/GPU architectures, IEEE Trans. Parall. Distrib. Syst., vol. 28, no. 3, pp. 905-918, 2017.
[11]
J. M. Dennis, M. Vertenstein, P. H. Worley, A. A. Mirin, A. P. Craig, and R. Jacob, Computational performance of ultra-high-resolution capability in the community earth system model, Int. J. High Perform. Comp. Appl., vol. 26, no. 1, pp. 5-16, 2012.
[12]
H. H. Fu, J. F. Liao, W. Xue, L. N. Wang, D. X. Chen, L. Gu, J. X. Xu, N. Ding, X. L. Wang, C. H. He, et al., Refactoring and optimizing the Community Atmosphere Model (CAM) on the sunway TaihuLight supercomputer, in Int. Conf. High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, 2016.
DOI
[13]
H. H. Fu, J. F. Liao, J. Z. Yang, L. N. Wang, Z. Y. Song, X. M. Huang, C. Yang, W. Xue, F. F. Liu, F. L. Qiao, et al., The Sunway TaihuLight supercomputer: System and applications, Sci. China Inform. Sci., vol. 59, no. 7, p. 072001, 2016.
[14]
J. Binney, The stellar-dynamical oeuvre, J. Astrophys. Astron., vol. 17, nos. 3&4, pp. 81-93, 1996.
[15]
F. Grasso and S. Pirozzoli, Shock wave-thermal inhomogeneity interactions: Analysis and numerical simulations of sound generation, Phys. Fluids, vol. 12, no. 1, pp. 205-219, 2000.
[16]
D. A. Jacobsen, J. C. Thibault, and I. Senocak, An MPI-CUDA implementation for massively parallel incompressible flow computations on multi-GPU clusters, in 48th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition, Orlando, FL, USA, 2010.
DOI
[17]
J. Y. Yang, S. C. Yang, Y. N. Chen, and C. A. Hsu, Implicit weighted ENO schemes for the three-dimensional incompressible Navier-Stokes equations, J. Comput. Phys., vol. 146, no. 1, pp. 464-487, 1998.
[18]
G. S. Jiang and D. P. Peng, Weighted ENO schemes for Hamilton-Jacobi equations, SIAM J. Sci. Comput., vol. 21, no. 6, pp. 2126-2143, 2000.
[19]
G. S. Jiang and C. C. Wu, A high-order WENO finite difference scheme for the equations of ideal magnetohydrodynamics, J. Comput. Phys., vol. 150, no. 2, pp. 561-594, 1999.
[20]
S. M. Liang and H. Chen, Numerical simulation of underwater blast-wave focusing using a high-order scheme, AIAA J., vol. 37, no. 8, pp. 1010-1013, 1999.
[21]
R. Liska and B. Wendroff, Composite schemes for conservation laws, SIAM J. Numer. Anal., vol. 35, no. 6, pp. 2250-2271, 1998.
[22]
P. Montarnal and C. W. Shu, Real gas computation using an energy relaxation method and high-order WENO schemes, J. Comput. Phys., vol. 148, no. 1, pp. 59-80, 1999.
[23]
S. Noelle, The MoT-ICE: A new high-resolution wave-propagation algorithm for multidimensional systems of conservation laws based on Fey’s method of transport, J. Comput. Phys., vol. 164, no. 2, pp. 283-334, 2000.
[24]
TOP500 list of the world’s top supercomputers, https://www.top500.org/lists/2016/06/, 2016.
[25]
G. J. Shan and C. S. Wang, Efficient implementation of weighted ENO schemes, J. Comput. Phys., vol. 126, no. 1, pp. 202-228, 1996.
[26]
D. S. Balsara and C. W. Shu, Monotonicity preserving weighted essentially non-oscillatory schemes with increasingly high order of accuracy, J. Comput. Phys., vol. 160, no. 2, pp. 405-452, 2000.
[27]
D. A. Jacobsen, J. C. Thibault, and I. Senocak, An MPI-CUDA implementation for massively parallel incompressible flow computations on multi-GPU clusters, in 48th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition, Orlando, FL, USA, 2010.
DOI
[28]
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey, 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs, in Proc. 2010 ACM/IEEE Int. Conf. High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, 2010.
DOI
[29]
H. Lin, X. C. Tang, B. W. Yu, Y. W. Zhuo, W. G. Chen, J. D. Zhai, W. W. Yin, and W. M. Zheng, Scalable graph traversal on sunway TaihuLight with ten million cores, in 2017 IEEE Int. Parallel and Distributed Proc. Symp. (IPDPS), Orlando, FL, USA, 2017.
DOI
[30]
J. Zhang, C. B. Zhou, Y. G. Wang, L. L. Ju, Q. Du, X. B. Chi, D. S. Xu, D. X. Chen, Y. Liu, and Z. Liu, Extreme-scale phase field simulations of coarsening dynamics on the sunway TaihuLight supercomputer, in Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, 2016.
DOI
[31]
C. Meng, L. Wang, Z. Y. Cao, L. L. Feng, and W. S. Zhu, Large-scale parallelization based on CPU and GPU cluster for cosmological fluid simulations, in Proc. 25th Int. Conf. Parallel Computational Fluid Dynamics, Changsha, China, pp. 207–220, 2014.
DOI
[32]
C. Yang, W. Xue, H. H. Fu, H. G. You, X. L. Wang, Y. L. Ao, F. F. Liu, L. Gan, P. Xu, L. N. Wang, et al., 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics, in Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, 2016, pp. 57-68.
DOI
[33]
W. Q. Dong, L. T. Kang, Z. Quan, K. L. Li, K. Q. Li, Z. Y. Hao, and X. H. Xie, Implementing molecular dynamics simulation on Sunway TaihuLight system, in 2016 IEEE 18th Int. Conf. High Performance Computing and Communications, Sydney, Australia, 2016.
DOI
[34]
B. Y. Li, B. Li, and D. P. Qian, PFSI.sw: A programming framework for sea ice model algorithms based on Sunway many-core processor, in 2017 IEEE 28th Int. Conf. Application-Specific Systems, Architectures and Processors (ASAP), Seattle, WA, USA, 2017.
DOI
[35]
J. Lin, Z. G. Xu, A. Nukada, N. Maruyama, and S. Matsuoka, Optimizations of two compute-bound scientific kernels on the SW26010 many-core processor, in 2017 46th Int. Conf. Parallel Processing (ICPP), Bristol, UK, 2017.
DOI
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 17 April 2018
Revised: 07 July 2018
Accepted: 09 July 2018
Published: 22 July 2019
Issue date: February 2020

Copyright

© The author(s) 2020

Acknowledgements

This paper was partially supported by the National High-Tech Research and Development (863) Program of China (No. 2015AA015306), the Science and Technology Plan of Beijing Municipality (No. Z161100000216147), the National Natural Science Foundation of China (No. 61762074), Youth Foundation Program of Qinghai University (No. 2016-QGY-5), the National Natural Science Foundation of Qinghai Province (No. 2019-ZJ-7034), and National Supercomputer Center in Wuxi, China.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return