Journal Home > Volume 26 , Issue 3

The Distributed Shared Memory (DSM) architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap, and it inevitably exhibits Non-Uniform Memory Access (NUMA) to shared-memory parallel applications. Failure to adapt to the NUMA effect can significantly downgrade application performance, especially on today’s manycore platforms with tens to hundreds of cores. However, traditional approaches such as first-touch and memory policy fall short in false page-sharing, fragmentation, or ease of use. In this paper, we propose a partitioned shared-memory approach that allows multithreaded applications to achieve full NUMA-awareness with only minor code changes and develop an accompanying NUMA-aware heap manager which eliminates false page-sharing and minimizes fragmentation. Experiments on a 256-core cc-NUMA computing node show that the proposed approach helps applications to adapt to NUMA with only minor code changes and improves the performance of typical multithreaded scientific applications by up to 4.3 folds with the increased use of cores.


menu
Abstract
Full text
Outline
About this article

PsmArena: Partitioned Shared Memory for NUMA-Awareness in Multithreaded Scientific Applications

Show Author's information Zhang Yang( )Aiqing ZhangZeyao Mo
Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics, Beijing 100088, China.

Abstract

The Distributed Shared Memory (DSM) architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap, and it inevitably exhibits Non-Uniform Memory Access (NUMA) to shared-memory parallel applications. Failure to adapt to the NUMA effect can significantly downgrade application performance, especially on today’s manycore platforms with tens to hundreds of cores. However, traditional approaches such as first-touch and memory policy fall short in false page-sharing, fragmentation, or ease of use. In this paper, we propose a partitioned shared-memory approach that allows multithreaded applications to achieve full NUMA-awareness with only minor code changes and develop an accompanying NUMA-aware heap manager which eliminates false page-sharing and minimizes fragmentation. Experiments on a 256-core cc-NUMA computing node show that the proposed approach helps applications to adapt to NUMA with only minor code changes and improves the performance of typical multithreaded scientific applications by up to 4.3 folds with the increased use of cores.

Keywords: partitioned shared memory, Non-Uniform Memory Access (NUMA), heap manager, multithread, manycore

References(26)

[1]
Z. Majo and T. R. Gross, (Mis)understanding the NUMA memory system performance of multithreaded workloads, in Proc. 2013 IEEE Int. Symp. Workload Characterization (IISWC), Portland, OR, USA, 2013, pp. 11-22.
DOI
[2]
S. G. Li, T. Hoefler, and M. Snir, NUMA-aware shared-memory collective communication for MPI, in Proc. 22nd Int. Symp. High-Performance Parallel and Distributed Computing, New York, NY, USA, 2013, pp. 85-96.
DOI
[3]
K. Y. Zhang, R. Chen, and H. B. Chen, NUMA-aware graphstructured analytics, ACM SIGPLAN Not., vol. 50, no. 8, pp. 183-193, 2015.
[4]
H. Guo, Z. Y. Mo, and A. Q. Zhang, A parallel module for the multiblock structured mesh in JASMIN and its applications, (in Chinese), Computer Eng. Sci., vol. 34, no. 8, pp. 69-74, 2012.
[5]
Z. Y. Mo, A. Q. Zhang, and Z. Yang, A new parallel algorithm for vertex priorities of data flow acyclic digraphs, J. Supercomput., vol. 68, no. 1, pp. 49-64, 2014.
[6]
G. Hager and G. Wellein, Introduction to High Performance Computing for Scientists and Engineers. Boca Raton, FL, USA: CRC Press, 2010.
DOI
[7]
A. Kleen, A NUMA API for Linux, http://halobates.de/numaapi3.pdf, 2005.
[8]
B. Goglin and N. Furmento, Enabling high-performance memory migration for multithreaded applications on LINUX, in Proc. 2009 IEEE Int. Symp. Parallel & Distributed Processing (IPDPS09), Washington, DC, USA, 2009, pp. 1-9.
DOI
[9]
M. Gorman, Automatic NUMA balancing v4, https://lwn.net/Articles/526097/, 2012.
[10]
M. Berzins, J. Beckvermit, T. Harman, A. Bezdjian, A. Humphrey, Q. Y. Meng, J. Schmidt, and C. Wight, Extending the Uintah framework through the petascale modeling of detonation in arrays of high explosive devices, SIAM J. Sci. Comput., vol. 38, no. 5, pp. S101-S122, 2016.
[11]
Z. Y. Mo, A. Q. Zhang, X. L. Cao, Q. K. Liu, X. W. Xu, H. B. An, W. B. Pei, and S. P. Zhu, JASMIN: A parallel software infrastructure for scientific computing, Front. Computer Sci. China, vol. 4, no. 4, pp. 480-488, 2010.
[12]
R. D. Falgout and U. M. Yang, hypre: A library of high performance preconditioners, in Proc. Int. Conf. Computational Science, Amsterdam, Netherlands, 2002, pp. 632-641.
DOI
[13]
T. El-Ghazawi and L. Smith, UPC: Unified parallel C, in Proc. 2006 ACM/IEEE Conf. Supercomputing, Tampa, FL, USA, 2006, p. 27.
DOI
[14]
S. Ghemawat and P. Menage, TCMalloc: Thread-caching malloc, https://gperftools.github.io/gperftools/tcmalloc.html, 2009.
[15]
H. Y. Li, H. J. Zhou, Y. Liu, X. F. Bao, and Z. G. Zhao, Massively parallel FDTD program JEMS-FDTD and its applications in platform coupling simulation, in Proc. 2014 Int. Symp. Electromagnetic Compatibility, Gothenburg, Sweden, 2014, pp. 229-233.
DOI
[16]
R. Rabenseifner, G. Hager, and G. Jost, Hybrid MPI/ OpenMP parallel programming on clusters of multi-core SMP nodes, in Proc. 2009 17th Euromicro Int. Conf. Parallel, Distributed and Network-Based Processing, Weimar, Germany, 2009, pp. 427-436.
DOI
[17]
M. Shaheen and R. Strzodka, NUMA aware iterative stencil computations on many-core systems, in Proc. 2012 IEEE 26th Int. Parallel and Distributed Processing Symp., Washington, DC, USA, 2012, pp. 461-473.
DOI
[18]
P. Metzger, M. Cole, and C. Fensch, NUMA optimizations for algorithmic skeletons, in Euro-Par 2018: Parallel Processing, M. Aldinucci, L. Padovani, and M. Torquati, eds. Springer, 2018, pp. 590-602.
DOI
[19]
F. Broquedis, N. Furmento, B. Goglin, P. A. Wacrenier, and R. Namyst, ForestGOMP: An efficient OpenMP environment for NUMA architectures, Int. J. Parallel Programm., vol. 38, nos. 5&6, pp. 418-439, 2010.
[20]
Z. Majo and T. R. Gross, A library for portable and composable data locality optimizations for NUMA systems, in Proc. 20th ACM SIGPLAN Symp Principles and Practice of Parallel Programming, New York, NY, USA, 2015, pp. 227-238.
DOI
[21]
G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J. J. Dongarra, PaRSEC: Exploiting heterogeneity to enhance scalability, Comput. Sci. Eng., vol. 15, no. 6, pp. 36-45, 2013.
[22]
L. L. Pilla, C. P. Ribeiro, D. Cordeiro, and J. F. Méhaut, Charm++ on NUMA platforms: The impact of SMP optimizations and a NUMA-aware load balancer, in Proc. 4th Workshop of the INRIA-Illinois Joint Laboratory on Petascale Computing, Urbana, IL, USA, 2010.
[23]
W. Q. Zhang, A. Almgren, M. Day, T. Nguyen, J. Shalf, and D. Unat, Boxlib with tiling: An adaptive mesh refinement software framework, SIAM J. Sci. Comput., vol. 38, no. 5, pp. S156-S172, 2016.
[24]
C. P. Ribeiro, M. Castro, J. F. Méhaut, and A. Carissimi, Improving memory affinity of Geophysics applications on NUMA platforms using Minas, in High Performance Computing for Computational Science-VECPAR 2010, J. M. L. M. Palma, M. Daydé, O. Marques, and J. C. Lopes, eds. Springer, 2011, pp. 279-292.
[25]
[26]
C. Cantalupo, V. Venkatesan, J. Hammond, K. Czurlyo, and S. D. Hammond, Memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. Albuquerque, NM, USA: Sandia National Lab., 2015.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 21 July 2019
Accepted: 29 July 2019
Published: 12 October 2020
Issue date: June 2021

Copyright

© The author(s) 2021.

Acknowledgements

The authors would like to thank Dr. Linping Wu from High Performance Computing Center of Institute of Applied Physics and Computational Mathematics for his help on understanding the OS interferences on cc-NUMA systems. Dr. Xu Liu and Dr. Xiaowen Xu contributed several key ideas to the refinement of this paper. This work was supported by the National Key Research and Development Program of China (No. 2016YFB0201300). The authors thank the reviewers for their helpful comments.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return