RBC: A Memory Architecture for Improved Performance and Energy Efficiency

Wenjie Liu; Ke Zhou; Ping Huang; Tianming Yang; Xubin He

doi:10.26599/TST.2019.9010077

Tsinghua Science and Technology 2021, 26(3): 347-360 https://doi.org/10.26599/TST.2019.9010077

Open Access | Issue | Published: 12 October 2020

RBC: A Memory Architecture for Improved Performance and Energy Efficiency

Show Author's Information Hide Author's Information Wenjie Liu, Ke Zhou, Ping Huang, Tianming Yang, Xubin He(

)

Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA.

Wuhan National Laboratory of Optoelectronics (WNLO), Huazhong University of Science and Technology, Wuhan 430074, China.

Huanghuai University, Zhumadian 463000, China.

Keywords:

memory system, Dynamic Random Access Memory (DRAM), row buffer conflict

Cite this article:

Liu W, Zhou K, Huang P, et al. RBC: A Memory Architecture for Improved Performance and Energy Efficiency. Tsinghua Science and Technology, 2021, 26(3): 347-360. https://doi.org/10.26599/TST.2019.9010077

Download citation

EndNote(RIS)

BibTeX

824

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

DRAM-based memory suffers from increasing row buffer conflicts, which causes significant performance degradation and power consumption. As memory capacity increases, the overheads of the row buffer conflict are increasingly worse as increasing bitline length, which results in high row activation and precharge latencies. In this work, we propose a practical approach called Row Buffer Cache (RBC) to mitigate row buffer conflict overheads efficiently. At the core of our proposed RBC architecture, the rows with good spatial locality are cached and protected, which are exempted from being interrupted by the accesses for rows with poor locality. Such an RBC architecture significantly reduces the overheads of performance and energy caused by row activation and precharge, and thus improves overall system performance and energy efficiency. We evaluate RBC architecture using SPEC CPU2006 on a DDR4 memory compared to a commodity baseline memory system. Results show that RBC improves the overall performance by up to $2.24 \times$ ( $16.1 %$ on average) and reduces the memory energy by up to $68.2 %$ ( $23.6 %$ on average) for single-core simulations. For multi-core simulations, RBC increases the overall performance by up to $1.55 \times$ ( $17 %$ on average) and reduces memory energy consumption by up to $35.4 %$ ( $21.3 %$ on average).

Full text

Abstract

Full text

Outline

About this article

RBC: A Memory Architecture for Improved Performance and Energy Efficiency

Show Author's information Hide Author's Information Wenjie Liu, Ke Zhou, Ping Huang, Tianming Yang, Xubin He(

)

Department of Computer and Information Sciences, Temple University, Philadelphia, PA 19122, USA.

Wuhan National Laboratory of Optoelectronics (WNLO), Huazhong University of Science and Technology, Wuhan 430074, China.

Huanghuai University, Zhumadian 463000, China.

Abstract

Keywords: memory system, Dynamic Random Access Memory (DRAM), row buffer conflict

References(32)

[1]

V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, et al., Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization, in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, Davis, CA, USA, 2013, pp. 185-197.

DOI

[2]

O. Seongil, Y. H. Son, N. S. Kim, and J. H. Ahn, Row-buffer decoupling: A case for low-latency dram microarchitecture, in Proceedings of ACM/IEEE 41st International Symposium on Computer Architecture, Minneapolis, MN, USA, 2014, pp. 337-348.

DOI

[3]

D. Lee, Y. Kim, V. Seshadri, J. Liu, L. Subramanian, and O. Mutlu, Tiered-latency dram: A low latency and low cost dram architecture, in Proceedings of IEEE 19th International Symposium on High Performance Computer Architecture, Shenzhen, China, 2013, pp. 615-626.

[4]

J. Stuecheli, D. Kaseridis, H. C Hunter, and L. K. John, Elastic refresh: Techniques to mitigate refresh penalties in high density memory, in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Atlanta, GA, USA, 2010, pp. 375-384.

DOI

[5]

P. Nair, C.-C. Chou, and M. K. Qureshi, A case for refresh pausing in DRAM memory systems, in Proceedings of IEEE 19th International Symposium on High Performance Computer Architecture, Shenzhen, China, 2013, pp. 627-638.

DOI

[6]

P. Huang, W. Liu, K. Tang, X. He, and K. Zhou, Rop: Alleviating refresh overheads via reviving the memory system in frozen cycles, in Proceedings of 45th International Conference on Parallel Processing, Philadelphia, PA, USA, 2016, pp. 169-178.

DOI

[7]

W. Liu, P. Huang, K. Tang, K. Zhou, and X. He, CAR: A compression-aware refresh approach to improve memory performance and energy efficiency, ACM SIGMETRICS Performance Evaluation Review, vol. 44, no. 1, pp. 373-374, 2016.

DOI Google Scholar

[8]

H. Ha, A. Pedram, S. Richardson, S. Kvatinsky, and M. Horowitz, Improving energy efficiency of DRAM by exploiting half page row access, in Proceedings of 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, China, 2016, pp. 1-12.

DOI

[9]

Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, Thread cluster memory scheduling: Exploiting differences in memory access behavior, in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, Atlanta, GA, USA, 2010, pp. 65-76.

DOI

[10]

J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, Improving system energy efficiency with memory rank subsetting, ACM Transactions on Architecture and Code Optimization, vol. 9, no. 1, p. 4, 2012.

DOI Google Scholar

[11]

J. H. Ahn, J. Leverich, R. Schreiber, and N. P. Jouppi, Multicore DIMM: An energy efficient memory module with independently controlled DRAMs, IEEE Computer Architecture Letters, vol. 8, no. 1, pp. 5-8, 2008.

DOI Google Scholar

[12]

S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, and J. D. Owens, Memory access scheduling, in Proceedings of ACM/IEEE 27th International Symposium on Computer Architecture, Vancouver, Canada, 2000, pp. 128-138.

DOI

[13]

D. Kaseridis, J. Stuecheli, and L. K. John, Minimalist open-page: A DRAM page-mode scheduling policy for the many-core era, in Proceedings of 44th Annual IEEE/ACM International Symposium on Microarchitecture, Porto Alegre, Brazil, 2011, pp. 24-35.

DOI

[14]

O. Mutlu, Memory scaling: A systems architecture perspective, in Proceedings of 5th IEEE International Memory Workshop, Monterey, CA, USA, pp. 21-25.

[15]

DDR4 SDRAM standard, http://www.jedec.org/standards-documents/results/jesd79-4%20ddr4, 2012.

[16]

J. Shao and B. T. Davis, A burst scheduling access reordering mechanism, in Proceedings of IEEE 13th International Symposium on High Performance Computer Architecture, Phoenix, AZ, USA, 2007, pp. 285-294.

DOI

[17]

K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian, and A. Davis, Micro-pages: Increasing DRAM efficiency with locality-aware data placement, in Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems, Pittsburgh, PA, USA, 2010, pp. 219-230.

DOI

[18]

V. Seshadri, T. Mullins, A. Boroumand, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, Gather-scatter dram: In-DRAM address translation to improve the spatial locality of non-unit strided accesses, in Proceedings of the 48th International Symposium on Microarchitecture, Waikiki, HI, USA, 2015, pp. 267-280.

DOI

[19]

P. Rosenfeld, E. Cooper-Balis, and B. Jacob, DRAMSim2: A cycle accurate memory system simulator, IEEE Computer Architecture Letters, vol. 10, no. 1, pp. 16-19, 2011.

DOI Google Scholar

[20]

D. Sanchez and C. Kozyrakis, Zsim: Fast and accurate microarchitectural simulation of thousand-core systems, ACM SIGARCH Computer Architecture News, vol. 41, no. 3, pp. 475-486, 2013.

DOI Google Scholar

[21]

C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, Pin: Building customized program analysis tools with dynamic instrumentation, in Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, Chicago, IL, USA, 2005, pp. 190-200.

DOI

[22]

P. Shivakumar and N. P. Jouppi, Cacti 3.0: An integrated cache timing, power, and area model, Report, WRL, 2001.

[23]

Micron system power calculator, http://www.micron.com/support/power-calc, 2019.

[24]

V. Young, P. J. Nair, and M. K. Qureshi, Dice: Compressing dram caches for bandwidth and capacity, in Proceedings of ACM/IEEE 44th Annual International Symposium on Computer Architecture, Toronto, Canada, 2017, pp. 627-638.

DOI

[25]

M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi-Azad, Bingo spatial data prefetcher, in Proceedings of IEEE International Symposium on High Performance Computer Architecture, Washington, DC, USA, 2019, pp. 399-411.

DOI

[26]

K. K.-W. Chang, D. Lee, Z. Chishti, A. R. Alameldeen, C. Wilkerson, Y. Kim, and O. Mutlu, Improving DRAM performance by parallelizing refreshes with accesses, in Proceedings of IEEE 20th International Symposium on High Performance Computer Architecture, Orlando, FL, USA, 2014, pp. 356-367.

DOI

[27]

T. Zhang, M. Poremba, C. Xu, G. Sun, and Y. Xie, CREAM: A concurrent-refresh-aware DRAM memory architecture, in Proceedings of IEEE 20th International Symposium on High Performance Computer Architecture, Orlando, FL, USA, 2014, pp. 368-379.

DOI

[28]

Y. Kim, V. Seshadri, D. Lee, J. Liu, and O. Mutlu, A case for exploiting subarray-level parallelism in DRAM, ACM SIGARCH Computer Architecture News, vol. 40, no. 3, pp. 368-379, 2012.

DOI Google Scholar

[29]

M. K. Jeong, D. H. Yoon, D. Sunwoo, M. Sullivan, I. Lee, and M. Erez, Balancing DRAM locality and parallelism in shared memory cmp systems, in Proceedings of IEEE International Symposium on High-Performance Comp Architecture, New Orleans, LA, USA, 2012, pp. 1-12.

DOI

[30]

W. Liu, P. Huang, T. Kun, T. Lu, K. Zhou, C. Li, and X. He, LAMS: A latency-aware memory scheduling policy for modern dram systems, in Proceedings of IEEE 35th International Performance Computing and Communications Conference, Las Vegas, NV, USA, 2016, pp. 1-8.

DOI

[31]

T. Zhang, K. Chen, C. Xu, G. Sun, T. Wang, and Y. Xie, Half-DRAM: A high-bandwidth and low-power DRAM architecture from the rethinking of fine-grained activation, in Proceedings of the 41st International Symposium on Computer Architecture, Minneapolis, MN, USA, 2014, pp. 349-360.

DOI

[32]

H. Yoon, J. Meza, R. Ausavarungnirun, R. A. Harding, and O. Mutlu, Row buffer locality aware caching policies for hybrid memories, in Proceedings of IEEE 30th International Conference on Computer Design, Montreal, Canada, 2012, pp. 337-344.

DOI

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 09 December 2019

Accepted: 19 December 2019

Published: 12 October 2020

Issue date: June 2021

Copyright

Acknowledgements

This work was supported by the US National Science Foundation (Nos. CCF-1717660 and CNS-1828363).

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).