AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
Article Link
Collect
Submit Manuscript
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Survey

xCCL: A Survey of Industry-Led Collective CommunicationLibraries for Deep Learning

Department of Computer Science and Engineering, University of California, Merced, Merced 95343, U.S.A.
Show Author Information

Abstract

Machine learning techniques have become ubiquitous both in industry and academic applications. Increasing model sizes and training data volumes necessitate fast and efficient distributed training approaches. Collective communications greatly simplify inter- and intra-node data transfer and are an essential part of the distributed training process as information such as gradients must be shared between processing nodes. In this paper, we survey the current state-of-the-art collective communication libraries (namely xCCL, including NCCL, oneCCL, RCCL, MSCCL, ACCL, and Gloo), with a focus on the industry-led ones for deep learning workloads. We investigate the design features of these xCCLs, discuss their use cases in the industry deep learning workloads, compare their performance with industry-made benchmarks (i.e., NCCL Tests and PARAM), and discuss key take-aways and interesting observations. We believe our survey sheds light on potential research directions of future designs for xCCLs.

Electronic Supplementary Material

Download File(s)
JCST-2210-12894-Highlights.pdf (612.5 KB)

References

[1]
Hwang K, Xu Z W. Scalable Parallel Computing: Technology, Architecture, Programming. McGraw-Hill, 1998.
[2]
Brown T B, Mann B, Ryder N et al. Language models are few-shot learners. In Proc. the 34th Int. Conf. Neural Information Processing Systems, Dec. 2020, pp.1877-1901.
[3]
Naumov M, Mudigere D, Shi H J M et al. Deep learning recommendation model for personalization and recommendation systems. arXiv: 1906.00091, 2019. https://arxiv.org/abs/1906.00091, Jan. 2023.
[4]
Bayatpour M, Chakraborty S, Subramoni H, Lu X Y, Panda D K. Scalable reduction collectives with data partitioning-based multi-leader design. In Proc. the 2017 Int. Conf. High Performance Computing, Networking, Storage and Analysis (SC), Nov. 2017. DOI: 10.1145/3126908.3126954.
[5]
Chu C H, Lu X Y, Awan A A, Subramoni H, Hashmi J, Elton B, Panda D K. Efficient and scalable multi-source streaming broadcast on GPU clusters for deep learning. In Proc. the 46th Int. Conf. Parallel Processing (ICPP), Aug. 2017, pp.161-170. DOI: 10.1109/ICPP.2017.25.
[6]
Panda D K, Lu X Y, Shankar D. High-Performance Big Data Computing. The MIT Press, 2022.
[7]
Lu X Y, Islam N S, Wasi-Ur-Rahman et al. High-performance design of Hadoop RPC with RDMA over InfiniBand. In Proc. the 42nd ICPP, Oct. 2013, pp.641-650. DOI: 10.1109/ICPP.2013.78.
[8]
Wasi-Ur-Rahman, Lu X Y, Islam N S, Panda D K. HOMR: A hybrid approach to exploit maximum overlapping in MapReduce over high performance interconnects. In Proc. the 28th ACM Int. Conf. Supercomputing (ICS), Jun. 2014, pp.33-42. DOI: 10.1145/2597652.2597684.
[9]
Islam N S, Lu X Y, Wasi-Ur-Rahman, Panda D K. SOR-HDFS: A SEDA-based approach to maximize overlapping in RDMA-enhanced HDFS. In Proc. the 23rd Int. Symp. High–Performance Parallel and Distributed Computing, Jun. 2014, pp.261-264. DOI: 10.1145/2600212.2600715.
[10]
Lu X Y, Shankar D, Gugnani S, Panda D K. High-performance design of Apache Spark with RDMA and its benefits on various workloads. In Proc. the 2016 IEEE Int. Conf. Big Data, Dec. 2016, pp.253-262. DOI: 10.1109/BigData.2016.7840611.
[11]
Kalia A, Kaminsky M, Andersen D G. Using RDMA efficiently for key-value services. In Proc. the 2014 ACM Conference on SIGCOMM, Aug. 2014, pp.295-306. DOI: 10.1145/2619239.2626299.
[12]
Shankar D, Lu X Y, Panda D K. SCOR-KV: SIMD-aware client-centric and optimistic RDMA-based key-value store for emerging CPU architectures. In Proc. the 2019 SC, Dec. 2019, pp.257-266. DOI: 10.1109/HiPC.2019.00040.
[13]
Dragojević A, Narayanan D, Hodson O, Castro M. FaRM: Fast remote memory. In Proc. the 11th USENIX Symposium on Networked Systems Design and Implementation, Apr. 2014, pp.401-414.
[14]
Shankar D, Lu X Y, Islam N, Wasi-Ur-Rahman, Panda D K. High-performance hybrid key-value store on modern clusters with RDMA interconnects and SSDs: Non-blocking extensions, designs, and benefits. In Proc. the 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2016, pp.393-402. DOI: 10.1109/IPDPS.2016.112.
[15]
Gugnani S, Lu X Y, Panda D K. Swift-X: Accelerating OpenStack swift with RDMA for building an efficient HPC cloud. In Proc. the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 2017, pp.238-247. DOI: 10.1109/CCGRID.2017.103.
[16]
Gugnani S, Lu X Y, Panda D K. Designing virtualization-aware and automatic topology detection schemes for accelerating Hadoop on SR-IOV-enabled clouds. In Proc. the 2016 IEEE Int. Conf. Cloud Computing Technology and Science, Dec. 2016, pp.152-159. DOI: 10.1109/CloudCom.2016.0037.
[17]
Zhang J, Lu X Y, Panda D K. Designing locality and NUMA aware MPI runtime for nested virtualization based HPC cloud with SR-IOV enabled InfiniBand. In Proc. the 13th ACM SIGPLAN/SIGOPS Int. Conf. Virtual Execution Environments, Apr. 2017, pp.187-200. DOI: 10.1145/3050748.3050765.
[18]

Chu C H, Lu X Y, Awan A A et al. Exploiting hardware multicast and GPUDirect RDMA for efficient broadcast. IEEE Trans. Parallel and Distributed Systems, 2019, 30(3): 575–588. DOI: 10.1109/TPDS.2018.2867222.

[19]
Zhang J, Lu X Y, Chu C H, Panda D K. C-GDR: High-performance container-aware GPUDirect MPI communication schemes on RDMA networks. In Proc. the 2019 IPDPS, May 2019, pp.242-251. DOI: 10.1109/IPDPS.2019.00034.
[20]

Li Y K, Qi H, Lu G, Jin F, Guo Y F, Lu X Y. Understanding hot interconnects with an extensive benchmark survey. BenchCouncil Trans. Benchmarks, Standards and Evaluations, 2022, 2(3): 100074. DOI: 10.1016/J.TBENCH.2022.100074.

[21]
Pacheco P. An Introduction to Parallel Programming. Elsevier, 2011. DOI: 10.1016/C2009-0-18471-4.
[22]

Gong Y F, He B S, Zhong J L. Network performance aware MPI collective communication operations in the cloud. IEEE Trans. Parallel and Distributed Systems, 2015, 26(11): 3079–3089. DOI: 10.1109/TPDS.2013.96.

[23]
Brown K A, Domke J, Matsuoka S. Hardware-centric analysis of network performance for MPI applications. In Proc. the 21st IEEE Int. Conf. Parallel and Distributed Systems (ICPADS), Dec. 2015, pp.692-699. DOI: 10.1109/ICPADS.2015.92.
[24]

Katseff H P. Incomplete hypercubes. IEEE Trans. Computers, 1988, 37(5): 604–608. DOI: 10.1109/12.4611.

[25]
Kalb J L, Lee D S. Network topology analysis. Technical Report SAND2008-0069. Sandia National Laboratories, Albuquerque, New Mexico, 2008. https://digital.library.unt.edu/ark:/67531/metadc845229/m2/1/high_res_d/1028919.pdf, Jan. 2023.
[26]
Kim J, Kim H. Router microarchitecture and scalability of ring topology in on-chip networks. In Proc. the 2nd Int. Workshop on Network on Chip Architectures, Dec. 2009, pp.5-10. DOI: 10.1145/1645213.1645217.
[27]

Bouknight W J, Denenberg S A, McIntyre D E, Randall J M, Sameh A H, Slotnick D L. The Illiac IV system. Proceedings of the IEEE, 1972, 60(4): 369–388. DOI: 10.1109/PROC.1972.8647.

[28]

Cheng S H, Zhong W, Isaacs K E, Mueller K. Visualizing the topology and data traffic of multi-dimensional torus interconnect networks. IEEE Access, 2018, 6: 57191–57204. DOI: 10.1109/ACCESS.2018.2872344.

[29]

Romanov A Y, Amerikanov A A, Lezhnev E V. Analysis of approaches for synthesis of networks-on-chip by using circulant topologies. Journal of Physics: Conference Series, 2018, 1050(1): 012071. DOI: 10.1088/1742-6596/1050/1/012071.

[30]
Ravankar A A, Sedukhin S G. Mesh-of-Tori: A novel interconnection network for frontal plane cellular processors. In Proc. the 1st Int. Conf. Networking and Computing, Nov. 2010, pp.281-284. DOI: 10.1109/IC-NC.2010.30.
[31]
Pham P H, Mau P, Kim C. A 64-PE folded-torus intra-chip communication fabric for guaranteed throughput in network-on-chip based applications. In Proc. the 2009 IEEE Custom Integrated Circuits Conference, Sept. 2009, pp.645-648. DOI: 10.1109/CICC.2009.5280748.
[32]

Al-Fares M, Loukissas A, Vahdat A. A scalable, commodity data center network architecture. ACM SIGCOMM Computer Communication Review, 2008, 38(4): 63–74. DOI: 10.1145/1402946.1402967.

[33]
Leiserson C E, Abuhamdeh Z S, Douglas D C et al. The network architecture of the connection machine CM-5 (extended abstract). In Proc. the 4th Annual ACM Symposium on Parallel Algorithms and Architectures, Jun. 1992, pp.272-285. DOI: 10.1145/140901.141883.
[34]
Valerio M, Moser L E, Melliar-Smith P M. Recursively scalable fat-trees as interconnection networks. In Proc. the 13th IEEE Annual International Phoenix Conference on Computers and Communications, Apr. 1994. DOI: 10.1109/PCCC.1994.504091.
[35]
Nienaber W. Effective routing on fat-tree topologies [Ph. D. Thesis]. Florida State University, Tallahassee, 2014.
[36]
Prisacari B, Rodriguez G, Minkenberg C, Hoefler T. Bandwidth-optimal all-to-all exchanges in fat tree networks. In Proc. the 27th ICS, Jun. 2013, pp.139-148. DOI: 10.1145/2464996.2465434.
[37]
Li Y, Pan D. OpenFlow based load balancing for fat-tree networks with multipath support. In Proc. the 12th IEEE International Conference on Communications, Jun. 2013.
[38]
Kim J, Dally W J, Scott S, Abts D. Technology-driven, highly-scalable dragonfly topology. In Proc. the 2008 International Symposium on Computer Architecture, Jun. 2008, pp.77-88. DOI: 10.1109/ISCA.2008.19.
[39]
Teh M Y, Wilke J J, Bergman K, Rumley S. Design space exploration of the dragonfly topology. In Lecture Notes in Computer Science 10524, Kunkel J, Yokota R, Taufer M et al. (eds.), Springer. pp.57-74. DOI: 10.1007/978-3-319-67630-2_5.
[40]
Prisacari B, Rodriguez G, Garcia M, Vallejo E, Beivide R, Minkenberg C. Performance implications of remote-only load balancing under adversarial traffic in dragonflies. In Proc. the 8th International Workshop on Interconnection Network Architecture: On-Chip, Multi-Chip, Jan. 2014. DOI: 10.1145/2556857.2556860.
[41]
Shpiner A, Haramaty Z, Eliad S, Zdornov V, Gafni B, Zahavi E. Dragonfly+: Low cost topology for scaling datacenters. In Proc. the 3rd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era (HiPINEB), Feb. 2017. DOI: 10.1109/HiPINEB.2017.11.
[42]
Bruck J, Ho C T, Kipnis S, Weathersby D. Efficient algorithms for all-to-all communications in multi-port message-passing systems. In Proc. the 6th Annual ACM Symposium on Parallel Algorithms and Architectures, Aug. 1994, pp.298-309. DOI: 10.1145/181014.181756.
[43]

Thakur R, Rabenseifner R, Gropp W. Optimization of collective communication operations in MPICH. The International Journal of High Performance Computing Applications, 2005, 19(1): 49–66. DOI: 10.1177/1094342005051521.

[44]
Pjesivac-Grbovic J. Towards automatic and adaptive optimizations of MPI collective operations [Ph.D. Thesis]. University of Tennessee, Knoxville, 2007.
[45]
Huse L P. Collective communication on dedicated clusters of workstations. In Proc. the 6th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, Sept. 1999, pp.469-476. DOI: 10.1007/3-540-48158-3_58.
[46]
Barnett M, Shuler L, van De Geijn R, Gupta S, Payne D G, Watts J. Interprocessor collective communication library (InterCom). In Proc. the IEEE Scalable High Performance Computing Conference, May 1994, pp.357-364. DOI: 10.1109/SHPCC.1994.296665.
[47]
Shroff M, Van De Geijn R A. CollMark: MPI collective communication benchmark. In Proc. the 2000 ICS, June 29–July 2.
[48]
Rabenseifner R. Optimization of collective reduction operations. In Proc. the 4th Int. Conf. Computational Science, Jun. 2004. DOI: 10.1007/978-3-540-24685-5_1.
[49]

Dong J B, Wang S C, Feng F et al. ACCL: Architecting highly scalable distributed training systems with highly efficient collective communication library. IEEE Micro, 2021, 41(5): 85–92. DOI: 10.1109/MM.2021.3091475.

[50]

Hockney R W. The communication challenge for MPP: Intel paragon and Meiko CS-2. Parallel Computing, 1994, 20(3): 389–398. DOI: 10.1016/S0167-8191(06)80021-9.

[51]
Benson G D, Chu C W, Huang Q, Caglar S G. A comparison of MPICH allgather algorithms on switched networks. In Proc. the 10th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, Oct. 2003, pp.335-343. DOI: 10.1007/978-3-540-39924-7_47.
[52]
Almási G, Heidelberger P, Archer C J et al. Optimization of MPI collective communication on BlueGene/L systems. In Proc. the 19th ICS, Jun. 2005, pp.253-262. DOI: 10.1145/1088149.1088183.
[53]
Sergeev A, Del Balso M. Horovod: Fast and easy distributed deep learning in TensorFlow. arXiv: 1802.05799, 2018. https://arxiv.org/abs/1802.05799, Jan. 2023.
[54]
Goyal P, Dollár P, Girshick R et al. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv: 1706.02677, 2017. https://arxiv.org/abs/1706.02677, Jan.2023.
[55]
Gupta U, Wu C, Wang X et al. The architectural implications of Facebook’s DNN-based personalized recommendation. In Proc. the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), Feb. 2020, pp.488-501. DOI: 10.1109/HPCA47549.2020.00047.
[56]
Mudigere D, Hao Y, Huang J et al. Software-hardware co-design for fast and scalable training of deep learning recommendation models. In Proc. the 49th Annual International Symposium on Computer Architecture, Jun. 2022, pp.993-1011. DOI: 10.1145/3470496.3533727.
[57]
Paszke A, Gross S, Massa F et al. Pytorch: An imperative style, high-performance deep learning library. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019.
[58]
Khudia D, Huang J Y, Basu P, Deng S, Liu H, Park J, Smelyanskiy M. FBGEMM: Enabling high-performance low-precision deep learning inference. arXiv: 2101.05615, 2021. https://arxiv.org/abs/2101.05615, Jan. 2023.
[59]
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp.770-778. DOI: 10.1109/CVPR.2016.90.
[60]
Deng J, Dong W, Socher R, Li L J, Li K, Li F F. ImageNet: A large-scale hierarchical image database. In Proc. the 2009 CVPR, Jun. 2009, pp.248-255. DOI: 10.1109/CVPR.2009.5206848.
[61]
Dean J, Corrado G S, Monga R, Chen K, Devin M, Le Q V, Mao M Z, Ranzato M A, Senior A, Tucker P, Yang K, Ng A Y. Large scale distributed deep networks. In Proc. the 25th Int. Conf. Neural Information Processing Systems, Dec. 2012, pp.1223-1231.
[62]
Abadi M, Barham P, Chen J et al. TensorFlow: A system for large-scale machine learning. In Proc. the 12th USENIX Conference on Operating Systems Design and Implementation, Nov. 2016, pp.265-283.
[63]
Awan A A, Bédorf J, Chu C H et al. Scalable distributed DNN training using TensorFlow and CUDA-aware MPI: Characterization, designs, and performance evaluation. In Proc. the 19th IEEE/ACM Int. Symp. Cluster, Cloud and Grid Computing (CCGRID), May 2019, pp.498-507. DOI: 10.1109/CCGRID.2019.00064.
[64]
Biswas R, Lu X Y, Panda D K. Designing a micro-benchmark suite to evaluate gRPC for TensorFlow: Early experiences. In Proc. the 9th Workshop on Big Data Benchmarks, Performance Optimization, and Emerging Hardware, Mar. 2018.
[65]
Biswas R, Lu X Y, Panda D K. Accelerating TensorFlow with adaptive RDMA-based gRPC. In Proc. the 25th IEEE Int. Conf. High Performance Computing (HiPC), Dec. 2018, pp.2-11. DOI: 10.1109/HiPC.2018.00010.
[66]
Jain A, Awan A A, Subramoni H, Panda D K. Scaling TensorFlow, PyTorch, and MXNet using MVAPICH2 for high-performance deep learning on Frontera. In Proc. the 3rd IEEE/ACM Workshop on Deep Learning on Supercomputers (DLS), Nov. 2019, pp.76-83. DOI: 10.1109/DLS49591.2019.00015.
[67]

Zhang Z, Zheng S, Wang Y S et al. MiCS: Near-linear scaling for training gigantic model on public cloud. Proceedings of the VLDB Endowment, 2022, 16(1): 37–50. DOI: 10.14778/3561261.3561265.

[68]
Rajbhandari S, Rasley J, Ruwase O, He Y X. ZeRO: Memory optimizations toward training trillion parameter models. In Proc. the 2020 SC, Nov. 2020.
[69]
Jia Y, Shelhamer E, Donahue J et al. Caffe: Convolutional architecture for fast feature embedding. In Proc. the 22nd ACM International Conference on Multimedia, Nov. 2014, pp.675-678. DOI: 10.1145/2647868.2654889.
[70]
Seide F, Agarwal A. CNTK: Microsoft’s open-source deep-learning toolkit. In Proc. the 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Aug. 2016, p.2135. DOI: 10.1145/2939672.2945397.
[71]
Chen T Q, Li M, Li Y T, Lin M, Wang N Y, Wang M J, Xiao T J, Xu B, Zhang C Y, Zhang Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv: 1512.01274, 2015. https://arxiv.org/abs/1512.01274, Jan. 2023.
[72]
Lin L X, Qiu S H, Yu Z Q, You L, Long X, Sun X Y, Xu J, Wang Z. AIACC-training: Optimizing distributed deep learning training through multi-streamed and concurrent gradient communications. In Proc. the 42nd IEEE Int. Conf. Distributed Computing Systems, Jul. 2022, pp.853-863. DOI: 10.1109/ICDCS54860.2022.00087.
[73]
Cowan M, Maleki S, Musuvathi M et al. MSCCL: Microsoft collective communication library. arXiv: 2201.11840, 2022. https://arxiv.org/abs/2201.11840v1, Jan. 2023.
[74]
Shah A, Chidambaram V, Cowan M et al. TACCL: Guiding collective algorithm synthesis using communication sketches. In Proc. the 2023 USENIX Symposium on Networked Systems Design and Implementation, April 2023.
[75]
Cai Z X, Liu Z Y, Maleki S, Musuvathi M, Mytkowicz T, Nelson J, Saarikivi O. Synthesizing optimal collective algorithms. In Proc. the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2021, pp.62-75. DOI: 10.1145/3437801.3441620.
[76]
Panda D K, Tomko K, Schulz K, Majumdar A. The MVAPICH project: Evolution and sustainability of an open source production quality MPI library for HPC. In Proc. the Workshop on Sustainable Software for Science: Practice and Experiences, Nov. 2013.
[77]
Wang G H, Venkataraman S, Phanishayee A et al. Blink: Fast and generic collectives for distributed ML. In Proc. the 2020 Machine Learning and Systems 2, Mar. 2020, pp.172-186.
[78]
Zhang Z, Chang C K, Lin H B et al. Is network the bottleneck of distributed training? In Proc. the 2020 Workshop on Network Meets AI & ML, Aug. 2020, pp.8-13. DOI: 10.1145/3405671.3405810.
[79]
Wickramasinghe U, Lumsdaine A. A survey of methods for collective communication optimization and tuning. arXiv: 1611.06334, 2016. https://arxiv.org/abs/1611.06334, Jan. 2023.
[80]

Chan E N, Heimlich M, Purkayastha A, van de Geijn R. Collective communication: Theory, practice, and experience. Concurrency and Computation: Practice and Experience, 2007, 19(13): 1749–1783. DOI: 10.1002/cpe.1206.

[81]

Pješivac-Grbović J, Angskun T, Bosilca G, Fagg G E, Gabriel E, Dongarra J J. Performance analysis of MPI collective operations. Cluster Computing, 2007, 10(2): 127–143. DOI: 10.1007/s10586-007-0012-0.

[82]
Vadhiyar S S, Fagg G E, Dongarra J. Automatically tuned collective communications. In Proc. the 2000 ACM/IEEE Conference on Supercomputing, Nov. 2000. DOI: 10.1109/SC.2000.10024.
[83]

Verbraeken J, Wolting M, Katzy J et al. A survey on distributed machine learning. ACM Computing Surveys, 2020, 53(2): 30. DOI: 10.1145/3377454.

[84]

Wang M, Fu W J, He X N, Hao S J, Wu X D. A survey on large-scale machine learning. IEEE Trans. Knowledge and Data Engineering, 2022, 34(6): 2574–2594. DOI: 10.1109/TKDE.2020.3015777.

[85]

Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. ACM Computing Surveys, 2019, 52(4): Article No. 65. DOI: 10.1145/3320060.

[86]

Mayer R, Jacobsen H A. Scalable deep learning on distributed infrastructures: Challenges, techniques, and tools. ACM Computing Surveys, 2020, 53(1): Article No. 3. DOI: 10.1145/3363554.

[87]

Ouyang S, Dong D Z, Xu Y M, Xiao L Q. Communication optimization strategies for distributed deep neural network training: A survey. Journal of Parallel and Distributed Computing, 2021, 149: 52–65. DOI: 10.1016/j.jpdc.2020.11.005.

[88]
Lee S, Purushwalkam S, Cogswell M, Crandall D, Batra D. Why M heads are better than one: Training a diverse ensemble of deep networks. arXiv: 1511.06314, 2015. https://arxiv.org/abs/1511.06314, Jan. 2023.
[89]
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In Proc. the 25th International Conference on Neural Information Processing Systems, Dec. 2012, pp.1097-1105.
[90]
Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In Proc. the 2015 CVPR, Jun. 2015. DOI: 10.1109/CVPR.2015.7298594.
[91]
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 CVPR, Jun. 2016, pp.770-778. DOI: 10.1109/CVPR.2016.90.
[92]
Shi S H, Wang Q, Chu X W. Performance modeling and evaluation of distributed deep learning frameworks on GPUs. In Proc. the DASC/PiCom/DataCom/CyberSciTech, Aug. 2018, pp.949-957. DOI: 10.1109/DASC/PiCom/DataCom/CyberSciTec.2018.000-4.
[93]

Hoefler T, Moor D. Energy, memory, and runtime tradeoffs for implementing collective communication operations. Supercomputing Frontiers and Innovations, 2014, 1(2): 58–75. DOI: 10.14529/jsfi140204.

Journal of Computer Science and Technology
Pages 166-195
Cite this article:
Weingram A, Li Y, Qi H, et al. xCCL: A Survey of Industry-Led Collective CommunicationLibraries for Deep Learning. Journal of Computer Science and Technology, 2023, 38(1): 166-195. https://doi.org/10.1007/s11390-023-2894-6

752

Views

7

Crossref

2

Web of Science

5

Scopus

0

CSCD

Altmetrics

Received: 08 October 2022
Revised: 09 November 2022
Accepted: 03 January 2023
Published: 28 February 2023
© Institute of Computing Technology, Chinese Academy of Sciences 2023
Return