- AI networks play an necessary function in interconnecting tens of 1000’s of GPUs collectively, forming the foundational infrastructure for coaching, enabling massive fashions with tons of of billions of parameters akin to LLAMA 3.1 405B.
- This week at ACM SIGCOMM 2024 in Sydney, Australia, we’re sharing particulars on the community we’ve got constructed at Meta over the previous few years to assist our large-scale distributed AI coaching workload.
- Our paper, “RDMA over Ethernet for Distributed AI Training at Meta Scale,” gives the main points on how we design, implement, and function one of many world’s largest AI networks at scale.
The rising prevalence of AI has launched a brand new period of communication calls for. Distributed coaching, specifically, imposes essentially the most important pressure on information middle networking infrastructure. As an illustration, a typical generative AI (GenAI) job could necessitate tight coordination of tens of thousands of GPUs over the course of a number of weeks. Setting up a dependable, high-performance community infrastructure able to accommodating this burgeoning demand necessitates a reevaluation of knowledge middle community design.
When Meta launched distributed GPU-based training, we determined to assemble specialised information middle networks tailor-made for these GPU clusters. We opted for RDMA Over Converged Ethernet model 2 (RoCEv2) because the inter-node communication transport for almost all of our AI capability.
We have now efficiently expanded our RoCE networks, evolving from prototypes to the deployment of quite a few clusters, every accommodating 1000’s of GPUs. These RoCE clusters assist an intensive vary of manufacturing distributed GPU coaching jobs, together with rating, content material suggestion, content material understanding, pure language processing, and GenAI mannequin coaching, amongst different workloads.
Topology
We constructed a devoted backend community particularly for distributed coaching. This allowed us to evolve, function, and scale independently from the remainder of the info middle community. To assist massive language fashions (LLMs), we expanded the backend community in the direction of the DC-scale, e.g., incorporating topology-awareness into the coaching job scheduler.
The separation
The coaching cluster depends on two unbiased networks: the frontend (FE) community for duties akin to information ingestion, checkpointing, and logging, and the backend (BE) community for coaching, as depicted beneath.
A coaching rack is related to each the FE and BE of the info middle community. The FE has a hierarchy of community layers – rack switches (RSWs), material switches (FSWs), and better – that homes the storage warehouse, which gives GPUs with the mandatory enter information for coaching workloads. We guarantee that there’s sufficient ingress bandwidth on the rack swap to not hinder the coaching workload.
The BE is a specialised material that connects all RDMA NICs in a non-blocking structure, offering excessive bandwidth, low latency, and lossless transport between any two GPUs within the cluster, no matter their bodily location. This backend material makes use of the RoCEv2 protocol, which encapsulates the RDMA service in UDP packets for transport over the community.
AI Zone
Our BE networks have undergone a number of transformations. Initially, our GPU clusters used a easy star topology with a couple of AI racks related to a central Ethernet swap operating the non-routable RoCEv1 protocol. This setup had clear limitations in GPU scale and swap redundancy. Due to this fact, we swiftly transitioned to a fabric-based structure for prolonged scalability and better availability.
We designed a two-stage Clos topology for AI racks, referred to as an AI Zone. The rack coaching swap (RTSW), serving because the leaf swap, affords scale-up connectivity for GPUs throughout the rack utilizing copper-based DAC cables. The backbone tier, composed of modular cluster coaching switches (CTSW), gives scale-out connectivity amongst all racks within the cluster. The CTSW has deep buffers statically divided over the ports within the chassis. The RTSWs connect with CTSWs through single-mode fiber and 400G pluggable transceivers.
The AI Zones are designed to assist numerous interconnected GPUs in a non-blocking method. Nevertheless, rising AI developments, akin to LLMs like Llama, demand a GPU scale bigger than what a single AI zone gives. To accommodate this, we designed an aggregator coaching swap (ATSW) layer that connects the CTSWs in a knowledge middle constructing, increasing the RoCE area past a single AI Zone.
Word, the cross-AI Zone connectivity is oversubscribed by design, with community visitors balanced utilizing ECMP. To mitigate the efficiency bottleneck for cross-AI Zone visitors, we enhanced the coaching job scheduler to discover a “minimal lower” when dividing the coaching nodes into totally different AI Zones, decreasing the cross-AI Zone visitors and thus collective completion time. The scheduler does this by studying the place of GPU servers within the logical topology to advocate a rank task.
Routing
The scaling of compute energy and community topology mentioned above led to the query of effectively stability and route the huge coaching visitors. Particularly, the AI coaching workloads had a number of difficult traits:
- Low entropy: In comparison with conventional information middle workloads, the quantity and the variety of flows for AI workloads are a lot smaller and the stream patterns are normally repetitive and predictable.
- Burstiness: On the time dimension, the flows normally exhibit the “on and of”’ nature within the time granularity of milliseconds.
- Elephant flows: For every burst, the depth of every stream might attain as much as the road charge of NICs.
ECMP and path pinning
We initially thought-about the extensively adopted ECMP, which locations flows randomly primarily based on the hashes on the five-tuple: supply and vacation spot IPs, supply and vacation spot UDP ports, and protocol. Nevertheless, and as anticipated, ECMP rendered poor efficiency for the coaching workload as a result of low stream entropy.
Alternatively, we designed and deployed a path-pinning scheme within the preliminary years of our deployment. This scheme routed packets to particular paths primarily based on the vacation spot “slice” (the index of the RTSW downlink). This labored effectively if every rack was absolutely assigned to the identical job and there was no failure within the community. Nevertheless, this was seldom true. We noticed that the rack might be partially allotted to a job, with solely one of many two hosts within the rack utilizing the uplink bandwidth. This fragmented job placement precipitated uneven visitors distribution and congestion on the uplinks of the actual RTSW and degraded the coaching efficiency as much as greater than 30%. Additional, community failures on a uplink or a CTSW precipitated the affected flows to be inconsistently reassigned to different CTSWs by ECMP. These reassigned flows collided with different present flows and slowed down the entire coaching job.
We mitigated the fast affect of those stream collisions by upgrading the bandwidth of the RTSW uplinks bandwidth by 2x. Therefore we allowed for the RTSW uplink capability to be 1:2 under-subscribed in comparison with the RTSW downlink capability. Whereas this mitigated the fast efficiency affect, this was an costly resolution because it required 2x community capability. Thus, we acknowledged this as a short-term mitigation and proceeded to additional levels of routing evolution.
Queue pair scaling
We subsequent revisited ECMP with an intent to extend the variety of flows for hierarchical collectives via the queue pair (QP) scaling software program function within the collective library.
To account for this, we configured switches to carry out Enhanced ECMP (E-ECMP) to moreover hash on the vacation spot QP area of a RoCE packet utilizing the UDF functionality of the swap ASIC. This elevated entropy and, in comparison with baseline ECMP with out QP scaling, we noticed that E-ECMP together with QP scaling confirmed efficiency enchancment of as much as 40% for the AllReduce collective.
We evaluated two QP scaling methods. The primary concerned splitting every message meant to be posted over a single QP, as an alternative onto a number of QPs leading to a number of flows. However it additionally produced smaller message sizes on material in addition to a number of ACKs. The second strategy concerned posting every message to a special queue, in a round-robin trend. For the NIC message sizes demonstrated in our manufacturing with NCCL, we noticed the latter to be performing effectively. This function has been necessary for ECMP scalability by rising the community flows for hierarchical collectives like AllReduce.
Whereas we improved ECMP efficiency with QP scaling, the underlying probabilistic nature of hashing was a persistent draw back of this routing scheme. Additionally, the necessity to customise the QP scaling issue and methodology primarily based on the workload kind, whereas workable within the short-term, offered long-term operational complexity.
Congestion management
As we transitioned to 400G deployments, we tried to tune DCQCN to adapt to new community speeds and topology. Nevertheless, with default DCQCN settings and doubled ECN thresholds in comparison with 200G networks, efficiency was degraded. Additional investigation revealed that DCQCN implementation in firmware has modified, introducing bugs and decreased visibility with issues regarding appropriate CNP counting.
We proceeded with out DCQCN for our 400G deployments. At the moment, we’ve got had over a 12 months of expertise with simply PFC for stream management, with out another transport-level congestion management. We have now noticed steady efficiency and lack of persistent congestion for coaching collectives.
Receiver-driven visitors admission
To mitigate the congestion for 400G and past, we co-designed the collective library and RoCE transport to implement receiver-driven visitors admission for higher efficiency. The diagram beneath exhibits that the GPU-to-GPU communication structure in our manufacturing coaching clusters predominantly makes use of two-stage copy and receiver-initiated communication through the NCCL collective library. Every GPU’s excessive bandwidth reminiscence (HBM) maintains a number of channels for parallel transmission of chunked collective messages. The sender GPU threads first copy information from the compute buffer to an obtainable channel buffer. The sender CPU proxy thread can solely put up an RDMA write request after receiving a clear-to-send (CTS) packet from the receiver, which incorporates the scale and reminiscence data. The receiver’s GPU threads then copy the channel buffer contents to the vacation spot compute buffer. Lastly, CPU proxy threads on each side recycle the channel buffer, and the receiver CPU proxy sends one other CTS packet as soon as the channel buffer is prepared.
We successfully leverage this mechanism as a receiver-driven visitors admission to restrict the quantity of in-flight visitors on the community, particularly when congestion begins to construct up. Nevertheless, configuring the best setting might be difficult as:
- The variety of channels is restricted as a result of useful resource rivalry on GPU threads with concurrent compute operations;
- Setting the channel buffer dimension requires a extra cautious stability between congestion spreading and bandwidth under-utilization than Infiniband on account of RoCE’s extra coarse-grained stream management and attainable end-host slowness.
Thus, we took two steps to enhance the efficiency. First, we experimentally decided the best parameter settings for the variety of channels and channel buffer dimension throughout varied coaching job sizes and collective sorts. Second, we applied excessive precedence queuing at switches for CTS packets to expedite the notifications and mitigate potential bandwidth hunger.
Congestion management has been a focus of analysis in RDMA networks. DCQCN has been the gold customary for storage-focused networks. Nevertheless, our expertise with distributed AI coaching workloads gives a special perspective on tailoring the congestion management algorithms. Regardless of turning off DCQCN and a number of situations of RTSW sending PFC to a deep-buffer CTSW, we’ve got not encountered a situation over the past 4 years the place manufacturing AI coaching visitors causes the CTSW to ship PFCs to RTSWs persistently.
Our present resolution depends upon cautious coordination between the collective communication library and the community. It could rely upon the relative throughput between GPU and community, which will not be relevant to all eventualities. We encourage the analysis neighborhood to place extra deal with this matter.
Shifting ahead
The design and operation of large-scale RoCE networks for distributed AI coaching workloads have developed to fulfill the rising calls for of computational density and scale. By segregating FE and BE networks, using varied routing schemes, and optimizing collective visitors patterns, we’ve got been in a position to construct a performant and dependable community infrastructure. These designs and insights underline the significance of deeply understanding the coaching workload and translating these implications into community part design, finally contributing to the development of distributed AI coaching infrastructure.
With the quick rising pattern of GenAI workload, our community infrastructure will evolve quickly.
Learn the paper
RDMA over Ethernet for Distributed AI Training at Meta Scale
Acknowledgements
We wish to thank all contributors to the paper, together with Rui Miao, Shengbao Zheng, Sai Jayesh Bondu, Guilherme Goes, Hany Morsy, Rohit Puri, Adi Mohammad Riftadi, Ashmitha Jeevaraj Shetty, Jingyi Yang, Shuqiang Zhang, Mikel Jimenez Fernandez, Shashi Gandham, Omar Baldonado. Many present and former individuals within the Community Infrastructure group at Meta have contributed to productionizing RoCE networks for AI coaching through the years. Specifically, we wish to acknowledge Srinivas Sridharan, Petr Lapukhov, Jose Leitao, and Brandon Taylor. This work is a detailed collaboration with our companions in Meta’s AI Manufacturing Engineering, AI and Programs Co-design, and AI {Hardware} Programs groups.