PsmArena: Partitioned Shared Memory for NUMA-Awareness in Multithreaded Scientific Applications

Zhang Yang; Aiqing Zhang; Zeyao Mo

doi:10.26599/TST.2019.9010036

Tsinghua Science and Technology 2021, 26(3): 287-295 https://doi.org/10.26599/TST.2019.9010036

Open Access | Issue | Published: 12 October 2020

PsmArena: Partitioned Shared Memory for NUMA-Awareness in Multithreaded Scientific Applications

Show Author's Information Hide Author's Information Zhang Yang(

), Aiqing Zhang, Zeyao Mo

Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics, Beijing 100088, China.

Keywords:

partitioned shared memory, Non-Uniform Memory Access (NUMA), heap manager, multithread, manycore

Cite this article:

Yang Z, Zhang A, Mo Z. PsmArena: Partitioned Shared Memory for NUMA-Awareness in Multithreaded Scientific Applications. Tsinghua Science and Technology, 2021, 26(3): 287-295. https://doi.org/10.26599/TST.2019.9010036

Download citation

EndNote(RIS)

BibTeX

1043

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

The Distributed Shared Memory (DSM) architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap, and it inevitably exhibits Non-Uniform Memory Access (NUMA) to shared-memory parallel applications. Failure to adapt to the NUMA effect can significantly downgrade application performance, especially on today’s manycore platforms with tens to hundreds of cores. However, traditional approaches such as first-touch and memory policy fall short in false page-sharing, fragmentation, or ease of use. In this paper, we propose a partitioned shared-memory approach that allows multithreaded applications to achieve full NUMA-awareness with only minor code changes and develop an accompanying NUMA-aware heap manager which eliminates false page-sharing and minimizes fragmentation. Experiments on a 256-core cc-NUMA computing node show that the proposed approach helps applications to adapt to NUMA with only minor code changes and improves the performance of typical multithreaded scientific applications by up to 4.3 folds with the increased use of cores.

Full text

Abstract

Full text

Outline

About this article

PsmArena: Partitioned Shared Memory for NUMA-Awareness in Multithreaded Scientific Applications

Show Author's information Hide Author's Information Zhang Yang(

), Aiqing Zhang, Zeyao Mo

Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics, Beijing 100088, China.

Abstract

Keywords: partitioned shared memory, Non-Uniform Memory Access (NUMA), heap manager, multithread, manycore

References(26)

[1]

Z. Majo and T. R. Gross, (Mis)understanding the NUMA memory system performance of multithreaded workloads, in Proc. 2013 IEEE Int. Symp. Workload Characterization (IISWC), Portland, OR, USA, 2013, pp. 11-22.

DOI

[2]

S. G. Li, T. Hoefler, and M. Snir, NUMA-aware shared-memory collective communication for MPI, in Proc. 22nd Int. Symp. High-Performance Parallel and Distributed Computing, New York, NY, USA, 2013, pp. 85-96.

DOI

[3]

K. Y. Zhang, R. Chen, and H. B. Chen, NUMA-aware graphstructured analytics, ACM SIGPLAN Not., vol. 50, no. 8, pp. 183-193, 2015.

DOI Google Scholar

[4]

H. Guo, Z. Y. Mo, and A. Q. Zhang, A parallel module for the multiblock structured mesh in JASMIN and its applications, (in Chinese), Computer Eng. Sci., vol. 34, no. 8, pp. 69-74, 2012.

Google Scholar

[5]

Z. Y. Mo, A. Q. Zhang, and Z. Yang, A new parallel algorithm for vertex priorities of data flow acyclic digraphs, J. Supercomput., vol. 68, no. 1, pp. 49-64, 2014.

DOI Google Scholar

[6]

G. Hager and G. Wellein, Introduction to High Performance Computing for Scientists and Engineers. Boca Raton, FL, USA: CRC Press, 2010.

DOI

[7]

A. Kleen, A NUMA API for Linux, http://halobates.de/numaapi3.pdf, 2005.

[8]

B. Goglin and N. Furmento, Enabling high-performance memory migration for multithreaded applications on LINUX, in Proc. 2009 IEEE Int. Symp. Parallel & Distributed Processing (IPDPS09), Washington, DC, USA, 2009, pp. 1-9.

DOI

[9]

M. Gorman, Automatic NUMA balancing v4, https://lwn.net/Articles/526097/, 2012.

[10]

M. Berzins, J. Beckvermit, T. Harman, A. Bezdjian, A. Humphrey, Q. Y. Meng, J. Schmidt, and C. Wight, Extending the Uintah framework through the petascale modeling of detonation in arrays of high explosive devices, SIAM J. Sci. Comput., vol. 38, no. 5, pp. S101-S122, 2016.

DOI Google Scholar

[11]

Z. Y. Mo, A. Q. Zhang, X. L. Cao, Q. K. Liu, X. W. Xu, H. B. An, W. B. Pei, and S. P. Zhu, JASMIN: A parallel software infrastructure for scientific computing, Front. Computer Sci. China, vol. 4, no. 4, pp. 480-488, 2010.

DOI Google Scholar

[12]

R. D. Falgout and U. M. Yang, hypre: A library of high performance preconditioners, in Proc. Int. Conf. Computational Science, Amsterdam, Netherlands, 2002, pp. 632-641.

DOI

[13]

T. El-Ghazawi and L. Smith, UPC: Unified parallel C, in Proc. 2006 ACM/IEEE Conf. Supercomputing, Tampa, FL, USA, 2006, p. 27.

DOI

[14]

S. Ghemawat and P. Menage, TCMalloc: Thread-caching malloc, https://gperftools.github.io/gperftools/tcmalloc.html, 2009.

[15]

H. Y. Li, H. J. Zhou, Y. Liu, X. F. Bao, and Z. G. Zhao, Massively parallel FDTD program JEMS-FDTD and its applications in platform coupling simulation, in Proc. 2014 Int. Symp. Electromagnetic Compatibility, Gothenburg, Sweden, 2014, pp. 229-233.

DOI

[16]

R. Rabenseifner, G. Hager, and G. Jost, Hybrid MPI/ OpenMP parallel programming on clusters of multi-core SMP nodes, in Proc. 2009 17th Euromicro Int. Conf. Parallel, Distributed and Network-Based Processing, Weimar, Germany, 2009, pp. 427-436.

DOI

[17]

M. Shaheen and R. Strzodka, NUMA aware iterative stencil computations on many-core systems, in Proc. 2012 IEEE 26th Int. Parallel and Distributed Processing Symp., Washington, DC, USA, 2012, pp. 461-473.

DOI

[18]

P. Metzger, M. Cole, and C. Fensch, NUMA optimizations for algorithmic skeletons, in Euro-Par 2018: Parallel Processing, M. Aldinucci, L. Padovani, and M. Torquati, eds. Springer, 2018, pp. 590-602.

DOI

[19]

F. Broquedis, N. Furmento, B. Goglin, P. A. Wacrenier, and R. Namyst, ForestGOMP: An efficient OpenMP environment for NUMA architectures, Int. J. Parallel Programm., vol. 38, nos. 5&6, pp. 418-439, 2010.

DOI Google Scholar

[20]

Z. Majo and T. R. Gross, A library for portable and composable data locality optimizations for NUMA systems, in Proc. 20th ACM SIGPLAN Symp Principles and Practice of Parallel Programming, New York, NY, USA, 2015, pp. 227-238.

DOI

[21]

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J. J. Dongarra, PaRSEC: Exploiting heterogeneity to enhance scalability, Comput. Sci. Eng., vol. 15, no. 6, pp. 36-45, 2013.

DOI Google Scholar

[22]

L. L. Pilla, C. P. Ribeiro, D. Cordeiro, and J. F. Méhaut, Charm++ on NUMA platforms: The impact of SMP optimizations and a NUMA-aware load balancer, in Proc. 4th Workshop of the INRIA-Illinois Joint Laboratory on Petascale Computing, Urbana, IL, USA, 2010.

[23]

W. Q. Zhang, A. Almgren, M. Day, T. Nguyen, J. Shalf, and D. Unat, Boxlib with tiling: An adaptive mesh refinement software framework, SIAM J. Sci. Comput., vol. 38, no. 5, pp. S156-S172, 2016.

DOI Google Scholar

[24]

C. P. Ribeiro, M. Castro, J. F. Méhaut, and A. Carissimi, Improving memory affinity of Geophysics applications on NUMA platforms using Minas, in High Performance Computing for Computational Science-VECPAR 2010, J. M. L. M. Palma, M. Daydé, O. Marques, and J. C. Lopes, eds. Springer, 2011, pp. 279-292.

[25]

P. Kaminski, NUMA aware heap memory manager, https://developer.amd.com / wordpress / media / 2012/10/NUMA_aware_heap_memory_manager_article_final.pdf, 2009.

[26]

C. Cantalupo, V. Venkatesan, J. Hammond, K. Czurlyo, and S. D. Hammond, Memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. Albuquerque, NM, USA: Sandia National Lab., 2015.

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 21 July 2019

Accepted: 29 July 2019

Published: 12 October 2020

Issue date: June 2021

Copyright

Acknowledgements

The authors would like to thank Dr. Linping Wu from High Performance Computing Center of Institute of Applied Physics and Computational Mathematics for his help on understanding the OS interferences on cc-NUMA systems. Dr. Xu Liu and Dr. Xiaowen Xu contributed several key ideas to the refinement of this paper. This work was supported by the National Key Research and Development Program of China (No. 2016YFB0201300). The authors thank the reviewers for their helpful comments.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).