AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (2.5 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

PsmArena: Partitioned Shared Memory for NUMA-Awareness in Multithreaded Scientific Applications

Zhang Yang( )Aiqing ZhangZeyao Mo
Laboratory of Computational Physics, Institute of Applied Physics and Computational Mathematics, Beijing 100088, China.
Show Author Information

Abstract

The Distributed Shared Memory (DSM) architecture is widely used in today’s computer design to mitigate the ever-widening processing-memory gap, and it inevitably exhibits Non-Uniform Memory Access (NUMA) to shared-memory parallel applications. Failure to adapt to the NUMA effect can significantly downgrade application performance, especially on today’s manycore platforms with tens to hundreds of cores. However, traditional approaches such as first-touch and memory policy fall short in false page-sharing, fragmentation, or ease of use. In this paper, we propose a partitioned shared-memory approach that allows multithreaded applications to achieve full NUMA-awareness with only minor code changes and develop an accompanying NUMA-aware heap manager which eliminates false page-sharing and minimizes fragmentation. Experiments on a 256-core cc-NUMA computing node show that the proposed approach helps applications to adapt to NUMA with only minor code changes and improves the performance of typical multithreaded scientific applications by up to 4.3 folds with the increased use of cores.

References

[1]
Z. Majo and T. R. Gross, (Mis)understanding the NUMA memory system performance of multithreaded workloads, in Proc. 2013 IEEE Int. Symp. Workload Characterization (IISWC), Portland, OR, USA, 2013, pp. 11-22.
[2]
S. G. Li, T. Hoefler, and M. Snir, NUMA-aware shared-memory collective communication for MPI, in Proc. 22nd Int. Symp. High-Performance Parallel and Distributed Computing, New York, NY, USA, 2013, pp. 85-96.
[3]
K. Y. Zhang, R. Chen, and H. B. Chen, NUMA-aware graphstructured analytics, ACM SIGPLAN Not., vol. 50, no. 8, pp. 183-193, 2015.
[4]
H. Guo, Z. Y. Mo, and A. Q. Zhang, A parallel module for the multiblock structured mesh in JASMIN and its applications, (in Chinese), Computer Eng. Sci., vol. 34, no. 8, pp. 69-74, 2012.
[5]
Z. Y. Mo, A. Q. Zhang, and Z. Yang, A new parallel algorithm for vertex priorities of data flow acyclic digraphs, J. Supercomput., vol. 68, no. 1, pp. 49-64, 2014.
[6]
G. Hager and G. Wellein, Introduction to High Performance Computing for Scientists and Engineers. Boca Raton, FL, USA: CRC Press, 2010.
[7]
A. Kleen, A NUMA API for Linux, http://halobates.de/numaapi3.pdf, 2005.
[8]
B. Goglin and N. Furmento, Enabling high-performance memory migration for multithreaded applications on LINUX, in Proc. 2009 IEEE Int. Symp. Parallel & Distributed Processing (IPDPS09), Washington, DC, USA, 2009, pp. 1-9.
[9]
M. Gorman, Automatic NUMA balancing v4, https://lwn.net/Articles/526097/, 2012.
[10]
M. Berzins, J. Beckvermit, T. Harman, A. Bezdjian, A. Humphrey, Q. Y. Meng, J. Schmidt, and C. Wight, Extending the Uintah framework through the petascale modeling of detonation in arrays of high explosive devices, SIAM J. Sci. Comput., vol. 38, no. 5, pp. S101-S122, 2016.
[11]
Z. Y. Mo, A. Q. Zhang, X. L. Cao, Q. K. Liu, X. W. Xu, H. B. An, W. B. Pei, and S. P. Zhu, JASMIN: A parallel software infrastructure for scientific computing, Front. Computer Sci. China, vol. 4, no. 4, pp. 480-488, 2010.
[12]
R. D. Falgout and U. M. Yang, hypre: A library of high performance preconditioners, in Proc. Int. Conf. Computational Science, Amsterdam, Netherlands, 2002, pp. 632-641.
[13]
T. El-Ghazawi and L. Smith, UPC: Unified parallel C, in Proc. 2006 ACM/IEEE Conf. Supercomputing, Tampa, FL, USA, 2006, p. 27.
[14]
S. Ghemawat and P. Menage, TCMalloc: Thread-caching malloc, https://gperftools.github.io/gperftools/tcmalloc.html, 2009.
[15]
H. Y. Li, H. J. Zhou, Y. Liu, X. F. Bao, and Z. G. Zhao, Massively parallel FDTD program JEMS-FDTD and its applications in platform coupling simulation, in Proc. 2014 Int. Symp. Electromagnetic Compatibility, Gothenburg, Sweden, 2014, pp. 229-233.
[16]
R. Rabenseifner, G. Hager, and G. Jost, Hybrid MPI/ OpenMP parallel programming on clusters of multi-core SMP nodes, in Proc. 2009 17th Euromicro Int. Conf. Parallel, Distributed and Network-Based Processing, Weimar, Germany, 2009, pp. 427-436.
[17]
M. Shaheen and R. Strzodka, NUMA aware iterative stencil computations on many-core systems, in Proc. 2012 IEEE 26th Int. Parallel and Distributed Processing Symp., Washington, DC, USA, 2012, pp. 461-473.
[18]
P. Metzger, M. Cole, and C. Fensch, NUMA optimizations for algorithmic skeletons, in Euro-Par 2018: Parallel Processing, M. Aldinucci, L. Padovani, and M. Torquati, eds. Springer, 2018, pp. 590-602.
[19]
F. Broquedis, N. Furmento, B. Goglin, P. A. Wacrenier, and R. Namyst, ForestGOMP: An efficient OpenMP environment for NUMA architectures, Int. J. Parallel Programm., vol. 38, nos. 5&6, pp. 418-439, 2010.
[20]
Z. Majo and T. R. Gross, A library for portable and composable data locality optimizations for NUMA systems, in Proc. 20th ACM SIGPLAN Symp Principles and Practice of Parallel Programming, New York, NY, USA, 2015, pp. 227-238.
[21]
G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J. J. Dongarra, PaRSEC: Exploiting heterogeneity to enhance scalability, Comput. Sci. Eng., vol. 15, no. 6, pp. 36-45, 2013.
[22]
L. L. Pilla, C. P. Ribeiro, D. Cordeiro, and J. F. Méhaut, Charm++ on NUMA platforms: The impact of SMP optimizations and a NUMA-aware load balancer, in Proc. 4th Workshop of the INRIA-Illinois Joint Laboratory on Petascale Computing, Urbana, IL, USA, 2010.
[23]
W. Q. Zhang, A. Almgren, M. Day, T. Nguyen, J. Shalf, and D. Unat, Boxlib with tiling: An adaptive mesh refinement software framework, SIAM J. Sci. Comput., vol. 38, no. 5, pp. S156-S172, 2016.
[24]
C. P. Ribeiro, M. Castro, J. F. Méhaut, and A. Carissimi, Improving memory affinity of Geophysics applications on NUMA platforms using Minas, in High Performance Computing for Computational Science-VECPAR 2010, J. M. L. M. Palma, M. Daydé, O. Marques, and J. C. Lopes, eds. Springer, 2011, pp. 279-292.
[26]
C. Cantalupo, V. Venkatesan, J. Hammond, K. Czurlyo, and S. D. Hammond, Memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. Albuquerque, NM, USA: Sandia National Lab., 2015.
Tsinghua Science and Technology
Pages 287-295
Cite this article:
Yang Z, Zhang A, Mo Z. PsmArena: Partitioned Shared Memory for NUMA-Awareness in Multithreaded Scientific Applications. Tsinghua Science and Technology, 2021, 26(3): 287-295. https://doi.org/10.26599/TST.2019.9010036

1446

Views

74

Downloads

8

Crossref

N/A

Web of Science

8

Scopus

0

CSCD

Altmetrics

Received: 21 July 2019
Accepted: 29 July 2019
Published: 12 October 2020
© The author(s) 2021.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return