Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer

Jianqiang Huang; Wentao Han; Xiaoying Wang; Wenguang Chen

doi:10.26599/TST.2018.9010112

Tsinghua Science and Technology 2020, 25(1): 56-67 https://doi.org/10.26599/TST.2018.9010112

Open Access | Issue | Published: 22 July 2019

Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer

Show Author's Information Hide Author's Information Jianqiang Huang, Wentao Han, Xiaoying Wang, Wenguang Chen(

)

State Key Laboratory of Plateau Ecology and Agriculture, Department of Computer Technology and Applications, Qinghai University, Xining 810016, China.

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.

Keywords:

Sunway TaihuLight, optimization, parallel algorithms, Weighted Essentially Non-Oscillatory scheme (WENO), many-core

Cite this article:

Huang J, Han W, Wang X, et al. Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer. Tsinghua Science and Technology, 2020, 25(1): 56-67. https://doi.org/10.26599/TST.2018.9010112

Download citation

EndNote(RIS)

BibTeX

727

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

A Weighted Essentially Non-Oscillatory scheme (WENO) is a solution to hyperbolic conservation laws, suitable for solving high-density fluid interface instability with strong intermittency. These problems have a large and complex flow structure. To fully utilize the computing power of High Performance Computing (HPC) systems, it is necessary to develop specific methodologies to optimize the performance of applications based on the particular system’s architecture. The Sunway TaihuLight supercomputer is currently ranked as the fastest supercomputer in the world. This article presents a heterogeneous parallel algorithm design and performance optimization of a high-order WENO on Sunway TaihuLight. We analyzed characteristics of kernel functions, and proposed an appropriate heterogeneous parallel model. We also figured out the best division strategy for computing tasks, and implemented the parallel algorithm on Sunway TaihuLight. By using access optimization, data dependency elimination, and vectorization optimization, our parallel algorithm can achieve up to 172× speedup on one single node, and additional 58× speedup on 64 nodes, with nearly linear scalability.

Full text

Abstract

Full text

Outline

About this article

Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer

Show Author's information Hide Author's Information Jianqiang Huang, Wentao Han, Xiaoying Wang, Wenguang Chen(

)

State Key Laboratory of Plateau Ecology and Agriculture, Department of Computer Technology and Applications, Qinghai University, Xining 810016, China.

Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China.

Abstract

Keywords: Sunway TaihuLight, optimization, parallel algorithms, Weighted Essentially Non-Oscillatory scheme (WENO), many-core

References(35)

[1]

A. Harten, B. Engquist, S. Osher, and S. R. Chakravarthy, Uniformly high order accurate essentially non-oscillatory schemes, III, J. Comput. Phys., vol. 71, no. 2, pp. 231-303, 1987.

DOI Google Scholar

[2]

X. D. Liu, S. Osher, and T. Chan, Weighted essentially non-oscillatory schemes, J. Comput. Phys., vol. 115, no. 1, pp. 200-212, 1994.

DOI Google Scholar

[3]

G. S. Jiang and C. W. Shu, Efficient implementation of weighted ENO scheme, J. Comput. Phys., vol. 126, no. 1, pp. 202-228, 1996.

DOI Google Scholar

[4]

J. C. Huang, H. Lin, T. J. Hsieh, and T. Y. Hsieh, Parallel preconditioned WENO scheme for three-dimensional flow simulation of NREL Phase VI Rotor, Comput. Fluids, vol. 45, no. 1, pp. 276-282, 2011.

DOI Google Scholar

[5]

L. Thais, A. E. Tejada-Martínez, T. B. Gatski, and G. Mompeana, A massively parallel hybrid scheme for direct numerical simulation of turbulent viscoelastic channel flow, Comput. Fluids, vol. 43, no. 1, pp. 134-142, 2011.

DOI Google Scholar

[6]

P. Kestener, F. Château, and R. Teyssier, Accelerating Euler equations numerical solver on graphics processing units, in Int. Conf. Algorithms and Architectures for Parallel Processing ICA3PP, C. H. Hsu, L. T. Yang, J. H. Park, and S. S. Yeo, eds. Springer, 2010, pp. 281-288.

DOI

[7]

J. Tölke and M. Krafczyk, TeraFLOP computing on a desktop PC with GPUs for 3D CFD, Int. J. Comput. Fluid Dynam., vol. 22, no. 7, pp. 443-456, 2008.

DOI Google Scholar

[8]

X. J. Yang, X. K. Liao, K. Lu, Q. F. Hu, J. Q. Song, and J. S. Su, The TianHe-1A supercomputer: Its hardware and software, J. Comp. Sci. Technol., vol. 26, no. 3, pp. 344-351, 2011.

DOI Google Scholar

[9]

X. K. Liao, L. Q. Xiao, C. Q. Yang, and Y. T. Lu, MilkyWay-2 supercomputer: System and application, Front. Comput. Sci., vol. 8, no. 3, pp. 345-356, 2014.

DOI Google Scholar

[10]

F. Zhang, J. D. Zhai, B. S. He, S. H. Zhang, and W. G. Chen, Understanding co-running behaviors on integrated CPU/GPU architectures, IEEE Trans. Parall. Distrib. Syst., vol. 28, no. 3, pp. 905-918, 2017.

DOI Google Scholar

[11]

J. M. Dennis, M. Vertenstein, P. H. Worley, A. A. Mirin, A. P. Craig, and R. Jacob, Computational performance of ultra-high-resolution capability in the community earth system model, Int. J. High Perform. Comp. Appl., vol. 26, no. 1, pp. 5-16, 2012.

DOI Google Scholar

[12]

H. H. Fu, J. F. Liao, W. Xue, L. N. Wang, D. X. Chen, L. Gu, J. X. Xu, N. Ding, X. L. Wang, C. H. He, et al., Refactoring and optimizing the Community Atmosphere Model (CAM) on the sunway TaihuLight supercomputer, in Int. Conf. High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, 2016.

DOI

[13]

H. H. Fu, J. F. Liao, J. Z. Yang, L. N. Wang, Z. Y. Song, X. M. Huang, C. Yang, W. Xue, F. F. Liu, F. L. Qiao, et al., The Sunway TaihuLight supercomputer: System and applications, Sci. China Inform. Sci., vol. 59, no. 7, p. 072001, 2016.

DOI Google Scholar

[14]

J. Binney, The stellar-dynamical oeuvre, J. Astrophys. Astron., vol. 17, nos. 3&4, pp. 81-93, 1996.

DOI Google Scholar

[15]

F. Grasso and S. Pirozzoli, Shock wave-thermal inhomogeneity interactions: Analysis and numerical simulations of sound generation, Phys. Fluids, vol. 12, no. 1, pp. 205-219, 2000.

DOI Google Scholar

[16]

D. A. Jacobsen, J. C. Thibault, and I. Senocak, An MPI-CUDA implementation for massively parallel incompressible flow computations on multi-GPU clusters, in 48th AIAA Aerospace Sciences Meeting Including the New Horizons Forum and Aerospace Exposition, Orlando, FL, USA, 2010.

DOI

[17]

J. Y. Yang, S. C. Yang, Y. N. Chen, and C. A. Hsu, Implicit weighted ENO schemes for the three-dimensional incompressible Navier-Stokes equations, J. Comput. Phys., vol. 146, no. 1, pp. 464-487, 1998.

DOI Google Scholar

[18]

G. S. Jiang and D. P. Peng, Weighted ENO schemes for Hamilton-Jacobi equations, SIAM J. Sci. Comput., vol. 21, no. 6, pp. 2126-2143, 2000.

DOI Google Scholar

[19]

G. S. Jiang and C. C. Wu, A high-order WENO finite difference scheme for the equations of ideal magnetohydrodynamics, J. Comput. Phys., vol. 150, no. 2, pp. 561-594, 1999.

DOI Google Scholar

[20]

S. M. Liang and H. Chen, Numerical simulation of underwater blast-wave focusing using a high-order scheme, AIAA J., vol. 37, no. 8, pp. 1010-1013, 1999.

DOI Google Scholar

[21]

R. Liska and B. Wendroff, Composite schemes for conservation laws, SIAM J. Numer. Anal., vol. 35, no. 6, pp. 2250-2271, 1998.

DOI Google Scholar

[22]

P. Montarnal and C. W. Shu, Real gas computation using an energy relaxation method and high-order WENO schemes, J. Comput. Phys., vol. 148, no. 1, pp. 59-80, 1999.

DOI Google Scholar

[23]

S. Noelle, The MoT-ICE: A new high-resolution wave-propagation algorithm for multidimensional systems of conservation laws based on Fey’s method of transport, J. Comput. Phys., vol. 164, no. 2, pp. 283-334, 2000.

DOI Google Scholar

[24]

TOP500 list of the world’s top supercomputers, https://www.top500.org/lists/2016/06/, 2016.

[25]

G. J. Shan and C. S. Wang, Efficient implementation of weighted ENO schemes, J. Comput. Phys., vol. 126, no. 1, pp. 202-228, 1996.

DOI Google Scholar

[26]

D. S. Balsara and C. W. Shu, Monotonicity preserving weighted essentially non-oscillatory schemes with increasingly high order of accuracy, J. Comput. Phys., vol. 160, no. 2, pp. 405-452, 2000.

DOI Google Scholar

[27]

DOI

[28]

A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey, 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs, in Proc. 2010 ACM/IEEE Int. Conf. High Performance Computing, Networking, Storage and Analysis, New Orleans, LA, USA, 2010.

DOI

[29]

H. Lin, X. C. Tang, B. W. Yu, Y. W. Zhuo, W. G. Chen, J. D. Zhai, W. W. Yin, and W. M. Zheng, Scalable graph traversal on sunway TaihuLight with ten million cores, in 2017 IEEE Int. Parallel and Distributed Proc. Symp. (IPDPS), Orlando, FL, USA, 2017.

DOI

[30]

J. Zhang, C. B. Zhou, Y. G. Wang, L. L. Ju, Q. Du, X. B. Chi, D. S. Xu, D. X. Chen, Y. Liu, and Z. Liu, Extreme-scale phase field simulations of coarsening dynamics on the sunway TaihuLight supercomputer, in Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, 2016.

DOI

[31]

C. Meng, L. Wang, Z. Y. Cao, L. L. Feng, and W. S. Zhu, Large-scale parallelization based on CPU and GPU cluster for cosmological fluid simulations, in Proc. 25th Int. Conf. Parallel Computational Fluid Dynamics, Changsha, China, pp. 207–220, 2014.

DOI

[32]

C. Yang, W. Xue, H. H. Fu, H. G. You, X. L. Wang, Y. L. Ao, F. F. Liu, L. Gan, P. Xu, L. N. Wang, et al., 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics, in Proc. Int. Conf. High Performance Computing, Networking, Storage and Analysis, Salt Lake City, UT, USA, 2016, pp. 57-68.

DOI

[33]

W. Q. Dong, L. T. Kang, Z. Quan, K. L. Li, K. Q. Li, Z. Y. Hao, and X. H. Xie, Implementing molecular dynamics simulation on Sunway TaihuLight system, in 2016 IEEE 18th Int. Conf. High Performance Computing and Communications, Sydney, Australia, 2016.

DOI

[34]

B. Y. Li, B. Li, and D. P. Qian, PFSI.sw: A programming framework for sea ice model algorithms based on Sunway many-core processor, in 2017 IEEE 28th Int. Conf. Application-Specific Systems, Architectures and Processors (ASAP), Seattle, WA, USA, 2017.

DOI

[35]

J. Lin, Z. G. Xu, A. Nukada, N. Maruyama, and S. Matsuoka, Optimizations of two compute-bound scientific kernels on the SW26010 many-core processor, in 2017 46th Int. Conf. Parallel Processing (ICPP), Bristol, UK, 2017.

DOI

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 17 April 2018

Revised: 07 July 2018

Accepted: 09 July 2018

Published: 22 July 2019

Issue date: February 2020

Copyright

Acknowledgements

This paper was partially supported by the National High-Tech Research and Development (863) Program of China (No. 2015AA015306), the Science and Technology Plan of Beijing Municipality (No. Z161100000216147), the National Natural Science Foundation of China (No. 61762074), Youth Foundation Program of Qinghai University (No. 2016-QGY-5), the National Natural Science Foundation of Qinghai Province (No. 2019-ZJ-7034), and National Supercomputer Center in Wuxi, China.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).