AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Survey

Reinvent Cloud Software Stacks for Resource Disaggregation

Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
University of the Chinese Academy of Sciences, Beijing 101408, China
Huawei Cloud, Shenzhen 518129, China
Show Author Information

Abstract

Due to the unprecedented development of low-latency interconnect technology, building large-scale disaggregated architecture is drawing more and more attention from both industry and academia. Resource disaggregation is a new way to organize the hardware resources of datacenters, and has the potential to overcome the limitations, e.g., low resource utilization and low reliability, of conventional datacenters. However, the emerging disaggregated architecture brings severe performance and latency problems to the existing cloud systems. In this paper, we take memory disaggregation as an example to demonstrate the unique challenges that the disaggregated datacenter poses to the existing cloud software stacks, e.g., programming interface, language runtime, and operating system, and further discuss the possible ways to reinvent the cloud systems.

Electronic Supplementary Material

Download File(s)
JCST-2304-13272-Highlights.pdf (521.6 KB)

References

[1]
Gao P X, Narayan A, Karandikar S, Carreira J, Han S, Agarwal R, Ratnasamy S, Shenker S. Network requirements for resource disaggregation. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, Nov. 2016, pp.249–264.
[2]
Shan Y Z, Huang Y T, Chen Y L, Zhang Y Y. LegoOS: A disseminated, distributed OS for hardware resource disaggregation. In Proc. the 13th USENIX Conference on Operating Systems Design and Implementation, Oct. 2018, pp.69–87.
[3]
Wang C X, Ma H R, Liu S, Qiao Y F, Eyolfson J, Navasca C, Lu S, Xu G H. MemLiner: Lining up tracing and application for a far-memory-friendly runtime. In Proc. the 16th USENIX Symposium on Operating Systems Design and Implementation, July 2022, pp.35–53.
[4]
Wang C X, Ma H R, Liu S, Li Y Q, Ruan Z Y, Nguyen K, Bond M D, Netravali R, Kim M, Xu G H. Semeru: A memory-disaggregated managed runtime. In Proc. the 14th USENIX Symposium on Operating Systems Design and Implementation, Nov. 2020, pp.261–280.
[5]
Qiao Y F, Wang C X, Ruan Z Y, Belay A, Lu Q D, Zhang Y Y, Kim M, Xu G H. Hermit: Low-latency, high-throughput, and transparent remote memory via feedback-directed asynchrony. In Proc. the 20th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2023, pp.181–198.
[6]
Gouk D, Lee S, Kwon M, Jung M. Direct access, high-performance memory disaggregation with DirectCXL. In Proc. the 2022 USENIX Annual Technical Conference, July 2022, pp.287–294.
[7]

Barroso L, Marty M, Patterson D, Ranganathan P. Attack of the killer microseconds. Communications of the ACM , 2017, 60(4): 48–54. DOI: 10.1145/3015146.

[8]
Li H C, Berger D S, Hsu L, Ernst D, Zardoshti P, Novakovic S, Shah M, Rajadnya S, Lee S, Agarwal I, Hill M D, Fontoura M, Bianchini R. Pond: CXL-based memory pooling systems for cloud platforms. In Proc. the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Jan. 2023, pp.574–587. DOI: 10.1145/3575693.3578835.
[9]
Ruan Z Y, Schwarzkopf M, Aguilera M K, Belay A. AIFM: High-performance, application-integrated far memory. In Proc. the 14th USENIX Conference on Operating Systems Design and Implementation, Nov. 2020, Article No. 18.
[10]
Schweizer H, Besta M, Hoefler T. Evaluating the cost of atomic operations on modern architectures. In Proc. the 2015 International Conference on Parallel Architecture and Compilation, Oct. 2015, pp.445–456. DOI: 10.1109/PACT.2015.24.
[11]
Wang C X, Cui H M, Cao T, Zigman J, Volos H, Mutlu O, Lv F, Feng X B, Xu G H. Panthera: Holistic memory management for big data processing over hybrid memories. In Proc. the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2019, pp.347–362. DOI: 10.1145/3314221.3314650.
[12]
Calciu I, Imran M T, Puddu I, Kashyap S, Al Maruf H, Mutlu O, Kolli A. Rethinking software runtimes for disaggregated memory. In Proc. the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Apr. 2021, pp.79–92. DOI: 10.1145/3445814.3446713.
[13]
Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. In Proc. the 6th Symposium on Operating System Design and Implementation, Dec. 2004.
[14]
Foster I, Kesselman C. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers Inc., 2003.
[15]

Fan J P, Chen M Y. Dynamic self-organized computer architecture based on grid-components (DSAG). Journal of Computer Research and Development , 2003, 40(12): 1737–1742. (in Chinese)

[16]
Li L, Cao Z, Chen M Y, Fan J P. A reconfigurable optical interconnect system for DSAG. In Proc. the 6th International Conference on Parallel and Distributed Computing Applications and Technologies, Dec. 2005, pp.31–34. DOI: 10.1109/PDCAT.2005.40.
[17]
Asanović K, Bodik R, Catanzaro B C, Gebis J J, Husbands P, Keutzer K, Patterson D A, Plishker W L, Shalf J, Williams S W, Yelick K A. The landscape of parallel computing research: A view from Berkeley. Technical Report, No. UCB/EECS-2006-183, EECS Department, University of California, 2006. https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html, Sept. 2023.
[18]
Asanović K. FireBox: A hardware building block for 2020 warehouse-scale computers. In Proc. the 12th USENIX Conference on File and Storage Technologies, Feb. 2014.
[19]
Li S. High throughput remote memory data path for cloud application [Bachelor's Thesis]. University of Chinese Academy of Sciences, 2023. (in Chinese)
[20]
Mellanox Technologies Inc. Introduction to InfiniBand. White Paper. https://network.nvidia.com/pdf/whitepapers/IB_Intro_WP_190.pdf, Sept. 2023.
[21]
Subramoni H, Potluri S, Kandalla K, Barth B, Vienne J, Keasler J, Tomko K, Schulz K, Moody A, Panda D K. Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes. In Proc. the International Conference on High Performance Computing, Networking, Storage and Analysis, Nov. 2012. DOI: 10.1109/SC.2012.47.
[22]
Lim K, Chang J C, Mudge T, Ranganathan P, Reinhardt S K, Wenisch T F. Disaggregated memory for expansion and sharing in blade servers. In Proc. the 36th Annual International Symposium on Computer Architecture, Jun. 2009, pp.267–278. DOI: 10.1145/1555754.1555789.
[23]
Wang C X, Qiao Y F, Ma H R, Liu S, Zhang Y Y, Chen W G, Netravali R, Kim M, Xu G H. Canvas: Isolated and adaptive swapping for multi-applications on remote memory. In Proc. the 20th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2023.
[24]
Vilanova L, Maudlej L, Bergman S, Miemietz T, Hille M, Asmussen N, Roitzsch M, Härtig H, Silberstein M. Slashing the disaggregation tax in heterogeneous data centers with FractOS. In Proc. the 17th European Conference on Computer Systems, Mar. 2022, pp.352–367. DOI: 10.1145/3492321.3519569.
[25]
Guo Z Y, Blanco Z, Shahrad M, Wei Z R, Dong B L, Li J M, Pota I, Xu H, Zhang Y Y. Decomposing and executing serverless applications as resource graphs. arXiv: 2206.13444, 2022. https://arxiv.org/abs/2206.13444, Oct. 2023.
[26]
Liu M. Fabric-centric computing. In Proc. the 19th Workshop on Hot Topics in Operating Systems, Jun. 2023, pp.118–126. DOI: 10.1145/3593856.3595907.
[27]
Al Maruf H, Chowdhury M. Effectively prefetching remote memory with leap. In Proc. the 2020 USENIX Conference on USENIX Annual Technical Conference, July 2020, Article No. 58.
[28]
Li H F, Liu K, Liang T, Li Z J, Lu T Y, Yuan H, Xia Y B, Bao Y G, Chen M Y, Shan Y Z. HoPP: Hardware-software co-designed page prefetching for disaggregated memory. In Proc. the 2023 IEEE International Symposium on High-Performance Computer Architecture, Feb. 25–Mar. 1, 2023, pp.1168–1181. DOI: 10.1109/HPCA56546.2023.10070986.
[29]
Zhou Y, Wassel H M G, Liu S H, Gao J Q, Mickens J, Yu M L, Kennelly C, Turner P, Culler D E, Levy H M, Vahdat A. Carbink: Fault-tolerant far memory. In Proc. the 16th USENIX Symposium on Operating Systems Design and Implementation, July 2022, pp.55–71.
[30]
Ousterhout A, Fried J, Behrens J, Belay A, Balakrishnan H. Shenango: Achieving high CPU efficiency for latency-sensitive datacenter workloads. In Proc. the 16th USENIX Conference on Networked Systems Design and Implementation, Feb. 2019, pp.361–378.
[31]
Ruan Z Y, Park S J, Aguilera M K, Belay A, Schwarzkopf M. Nu: Achieving microsecond-scale resource fungibility with logical processes. In Proc. the 20th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2023, pp.1409–1427.
[32]
Shen J C, Zuo P F, Luo X C, Yang T Y, Su Y X, Zhou Y F, Lyu M R. FUSEE: A fully memory-disaggregated key-value store. In Proc. the 21st USENIX Conference on File and Storage Technologies, Feb. 2023, pp.81–97
[33]
Li P F, Hua Y, Zuo P F, Chen Z Y, Sheng J J. ROLEX: A scalable RDMA-oriented learned key-value store for disaggregated memory systems. In Proc. the 21st USENIX Conference on File and Storage Technologies, Feb. 2023, pp.99–113.
[34]
Luo X C, Zuo P F, Shen J C, Gu J Z, Wang X, Lyu M R, Zhou Y F. SMART: A high-performance adaptive radix tree for disaggregated memory. In Proc. the 17th USENIX Symposium on Operating Systems Design and Implementation, July 2023, pp.553–571.
[35]
Zuo P F, Sun J Z, Yang L, Zhang S W, Hua Y. One-sided RDMA-conscious extendible hashing for disaggregated memory. In Proc. the 2021 USENIX Annual Technical Conference, July 2021, pp.15–29.
[36]
Ma H R, Liu S, Wang C X, Qiao Y F, Bond M D, Blackburn S M, Kim M, Xu G H. Mako: A low-pause, high-throughput evacuating collector for memory-disaggregated datacenters. In Proc. the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation, Jun. 2022, pp.92–107. DOI: 10.1145/3519939.3523441.
[37]
Li S, Chen K, Brockman J B, Jouppi N P. Performance impacts of non-blocking caches in out-of-order processors. Technical Report, HPL-2011-65, HP Laboratories, 2011. https://www.hpl.hp.com/techreports/2011/HPL-2011-65.html, Sept. 2023.
[38]
Kroft D. Lockup-free instruction fetch/prefetch cache organization. In Proc. the 8th Annual Symposium on Computer Architecture, May 1981, pp.81–87.
[39]

Farkas K I, Jouppi N P. Complexity/performance tradeoffs with non-blocking loads. ACM SIGARCH Computer Architecture News , 1994, 22(2): 211–222. DOI: 10.1145/192007.192029.

[40]
Tuck J, Ceze L, Torrellas J. Scalable cache miss handling for high memory-level parallelism. In Proc. the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2006, pp.409–422. DOI: 10.1109/MICRO.2006.44.
[41]
Amaro E, Branner-Augmon C, Luo Z H, Ousterhout A, Aguilera M K, Panda A, Ratnasamy S, Shenker S. Can far memory improve job throughput? In Proc. the 15th European Conference on Computer Systems, Apr. 2020, Article No. 14. DOI: 10.1145/3342195.3387522.
[42]
Mars J, Tang L J, Hundt R, Skadron K, Soffa M L. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Proc. the 44th Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2011, pp.248–259.
[43]

Delimitrou C, Kozyrakis C. Paragon: QoS-aware scheduling for heterogeneous datacenters. ACM SIGPLAN Notices , 2013, 48(4): 77–88. DOI: 10.1145/2499368.2451125.

[44]
Liu Y H, Deng X, Zhou J P, Chen M Y, Bao Y G. Ah-Q: Quantifying and handling the interference within a datacenter from a system perspective. In Proc. the 2023 IEEE International Symposium on High-Performance Computer Architecture, Feb. 25–Mar. 1, 2023, pp.471–484. DOI: 10.1109/HPCA56546.2023.10071128.
[45]
Nelson J, Holt B, Myers B, Briggs P, Ceze L, Kahan S, Oskin M. Latency-tolerant software distributed shared memory. In Proc. the 2015 USENIX Conference on USENIX Annual Technical Conference, July 2015, pp.291–305.
[46]
Goodman J R, Woest P J. The Wisconsin Multicube: A new large-scale cache-coherent multiprocessor. In Proc. the 15th Annual International Symposium on Computer Architecture, May 30–June 2, 1988, pp.422–431. DOI: 10.1109/ISCA.1988.5253.
[47]

Kuskin J, Ofelt D, Heinrich M, Heinlein J, Simoni R, Gharachorloo K, Chapin J, Nakahira D, Baxter J, Horowitz M, Gupta A, Rosenblum M, Hennessy J. The Stanford FLASH multiprocessor. ACM SIGARCH Computer Architecture News , 1994, 22(4): 302–313. DOI: 10.1145/192007.192056.

[48]
Goodman J, Hum H H J. MESIF: A two-hop cache coherency protocol for point-to-point interconnects. Technical Report, University of Auckland, 2009. https://www.cs.auckland.ac.nz/~goodman/TechnicalReports/MESIF-2009.pdf. Setp. 2023.
[49]
Kalia A, Kaminsky M, Andersen D G. Datacenter RPCs can be general and fast. In Proc. the 16th USENIX Conference on Networked Systems Design and Implementation, Feb. 2019.
[50]
International Data Group. 2020 IDG cloud computing survey, 2020. https://cdn2.hubspot.net/hubfs/1624046/2020%20Cloud%20Computing%20executive%20summary_v2.pdf, Sept. 2023.
[51]

Dally W J, Turakhia Y, Han S. Domain-specific hardware accelerators. Communications of the ACM , 2020, 63(7): 48–57. DOI: 10.1145/3361682.

[52]
Esmaeilzadeh H, Blem E, Amant R S, Sankaralingam K, Burger D. Dark silicon and the end of multicore scaling. In Proc. the 38th Annual International Symposium on Computer Architecture, Jun. 2011, pp.365–376. DOI: 10.1145/2000064.2000108.
[53]
Zaharia M, Chowdhury M, Franklin M J, Shenker S, Stoica I. Spark: Cluster computing with working sets. In Proc. the 2nd USENIX Conference on Hot Topics in Cloud Computing, Jun. 2010.
[54]

Chen L, Zhao J C, Wang C X, Cao T, Zigman J, Volos H, Mutlu O, Lv F, Feng X B, Xu G H, Cui H M. Unified holistic memory management supporting multiple big data processing frameworks over hybrid memories. ACM Trans. Computer Systems , 2021, 39(1/2/3/4): Article No. 2. DOI: 10.1145/3511211.

[55]
Tsai S Y, Zhang Y Y. LITE kernel RDMA support for datacenter applications. In Proc. the 26th Symposium on Operating Systems Principles, Oct. 2017, pp.306–324. DOI: 10.1145/3132747.3132762.
[56]
McClure S, Ousterhout A, Shenker S, Ratnasamy S. Efficient scheduling policies for microsecond-scale tasks. In Proc. the 19th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2022.
[57]
Ma J Y, Sui X F, Sun N H, Li Y P, Yu Z H, Huang B W, Xu T N, Yao Z C, Chen Y, Wang H B, Zhang L X, Bao Y G. Supporting differentiated services in computers via programmable architecture for resourcing-on-demand (PARD). In Proc. the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2015, pp.131–143. DOI: 10.1145/2694344.2694382.
[58]
Ziegler T, Tumkur Vani S, Binnig C, Fonseca R, Kraska T. Designing distributed tree-based index structures for fast RDMA-capable networks. In Proc. the 2019 International Conference on Management of Data, Jun. 2019, pp.741–758.
[59]
Wang Q, Lu Y Y, Shu J W. Sherman: A write-optimized distributed B+tree index on disaggregated memory. In Proc. the 2022 International Conference on Management of Data, Jun. 2022, pp.1033–1048. DOI: 10.1145/3514221.3517824.
[60]
Wei X D, Chen R, Chen H B. Fast RDMA-based ordered Key-Value store using remote learned cache. In Proc. the 14th USENIX Symposium on Operating Systems Design and Implementation, Nov. 2020, pp.117–135.
[61]
Kraska T, Beutel A, Chi E H, Dean J, Polyzotis N. The case for learned index structures. In Proc. the 2018 International Conference on Management of Data, May 2018, pp.489–504. DOI: 10.1145/3183713.3196909.
[62]
Gibson D, Hariharan H, Lance E, McLaren M, Montazeri B, Singh A, Wang S, H. Wassel H M G, Wu Z H, Yoo S, Balasubramanian R, Chandra P, Cutforth M, Cuy P, Decotigny D, Gautam R, Iriza A, Martin M M K, Roy R, Shen Z W, Tan M, Tang Y, Wong-Chan M, Zbiciak J, Vahdat A. Aquila: A unified, low-latency fabric for datacenter networks. In Proc. the 19th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2022. pp.1249–1266.
[63]

Xu Z W, Li C D. Low-entropy cloud computing systems. SCIENTIA SINICA Informationis , 2017, 47(9): 1149–1163. DOI: 10.1360/N112017-00069.

[64]
Guo Z Y, Shan Y Z, Luo X H, Huang Y T, Zhang Y Y. Clio: A hardware-software co-designed disaggregated memory system. In Proc. the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Feb. 2022, pp.417–433. DOI: 10.1145/3503222.3507762.
[65]
Korolija D, Koutsoukos D, Keeton K, Taranov K, Milojičić D, Alonso G. Farview: Disaggregated memory with operator off-loading for database engines. arXiv: 2106.07102, 2021. https://arxiv.org/abs/2106.07102, Oct. 2023.
[66]
Sidler D, Wang Z K, Chiosa M, Kulkarni A, Alonso G. StRoM: Smart remote memory. In Proc. the 15th European Conference on Computer Systems, Apr. 2020, Article No. 29. DOI: 10.1145/3342195.3387519.
[67]
Yoon W, Oh J, Ok J, Moon S, Kwon Y. DiLOS: Adding performance to paging-based memory disaggregation. In Proc. the 12th ACM SIGOPS Asia-Pacific Workshop on Systems, Aug. 2021, pp.70–78. DOI: 10.1145/3476886.3477 507.
[68]
Lattner C, Adve V. Automatic pool allocation: Improving performance by controlling data structure layout in the heap. In Proc. the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Jun. 2005, pp.129–142.
[69]

Akram S, Sartor J B, McKinley K S, Eeckhout L. Write-rationing garbage collection for hybrid memories. ACM SIGPLAN Notices , 2018, 53(4): 62–77. DOI: 10.1145/3296979.3192392.

[70]

Larus J, Hunt G. The singularity system. Communications of the ACM , 2010, 53(8): 72–79. DOI: 10.1145/1787234.1787253.

[71]
Lee Y, Al Maruf H, Chowdhury M, Cidon A, Shin K G. Hydra: Resilient and highly available remote memory. In Proc. the 20th USENIX Conference on File and Storage Technologies, Feb. 2022, pp.181–198.
[72]
Chen S, Delimitrou C, Martínez J F. PARTIES: QoS-aware resource partitioning for multiple interactive services. In Proc. the 24th International Conference on Architectural Support for Programming Languages and Operating Systems, Apr. 2019, pp.107–120. DOI: 10.1145/3297858.3304005.
[73]

Delimitrou C, Kozyrakis C. Amdahl’s law for tail latency. Communications of the ACM , 2018, 61(8): 65–72. DOI: 10.1145/3232559.

[74]
Fried J, Ruan Z Y, Ousterhout A, Belay A. Caladan: Mitigating interference at microsecond timescales. In Proc. the 14th USENIX Conference on Operating Systems Design and Implementation, Nov. 2020, Article No. 16.
[75]
Zhao J C, Feng X B, Cui H M, Yan Y L, Xue J L, Yang W S. An empirical model for predicting cross-core performance interference on multicore processors. In Proc. the 22nd International Conference on Parallel Architectures and Compilation Techniques, Sept. 2013, pp.201–212. DOI: 10.1109/PACT.2013.6618817.
[76]
Liu L, Li Y, Cui Z H, Bao Y G, Chen M Y, Wu C Y. Going vertical in memory management: Handling multiplicity by multi-policy. In Proc. the 41st International Symposium on Computer Architecture, Jun. 2014, pp.169–180. DOI: 10.1109/ISCA.2014.6853214.
[77]
Hwang J, Vuppalapati M, Peter S, Agarwal R. Rearchitecting linux storage stack for μs latency and high throughput. In Proc. the 15th USENIX Symposium on Operating Systems Design and Implementation, July 2021, pp.113–128.
Journal of Computer Science and Technology
Pages 949-969
Cite this article:
Wang C-X, Shan Y-Z, Zuo P-F, et al. Reinvent Cloud Software Stacks for Resource Disaggregation. Journal of Computer Science and Technology, 2023, 38(5): 949-969. https://doi.org/10.1007/s11390-023-3272-0

228

Views

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 03 April 2023
Accepted: 01 September 2023
Published: 30 September 2023
© Institute of Computing Technology, Chinese Academy of Sciences 2023
Return