- On the Open Compute Challenge (OCP) World Summit 2024, we’re showcasing our newest open AI {hardware} designs with the OCP group.
- These improvements embody a brand new AI platform, cutting-edge open rack designs, and superior community materials and parts.
- By sharing our designs, we hope to encourage collaboration and foster innovation. Should you’re obsessed with constructing the way forward for AI, we invite you to have interaction with us and OCP to assist form the following technology of open {hardware} for AI.
AI has been on the core of the experiences Meta has been delivering to folks and companies for years, together with AI modeling improvements to optimize and enhance on options like Feed and our ads system. As we develop and launch new, superior AI fashions, we’re additionally pushed to advance our infrastructure to help our new and rising AI workloads.
For instance, Llama 3.1 405B, Meta’s largest mannequin, is a dense transformer with 405B parameters and a context window of as much as 128k tokens. To coach a big language mannequin (LLM) of this magnitude, with over 15 trillion tokens, we needed to make substantial optimizations to our whole coaching stack. This effort pushed our infrastructure to function throughout greater than 16,000 NVIDIA H100 GPUs, making Llama 3.1 405B the primary mannequin within the Llama collection to be educated at such a large scale.
Previous to Llama, our largest AI jobs ran on 128 NVIDIA A100 GPUs. However issues have quickly accelerated. Over the course of 2023, we quickly scaled up our coaching clusters from 1K, 2K, 4K, to ultimately 16K GPUs to help our AI workloads. In the present day, we’re coaching our fashions on two 24K-GPU clusters.
We don’t count on this upward trajectory for AI clusters to decelerate any time quickly. The truth is, we count on the quantity of compute wanted for AI coaching will develop considerably from the place we’re as we speak.
Constructing AI clusters requires extra than simply GPUs. Networking and bandwidth play an necessary position in guaranteeing the clusters’ efficiency. Our techniques include a tightly built-in HPC compute system and an remoted high-bandwidth compute community that connects all our GPUs and domain-specific accelerators. This design is important to fulfill our injection wants and deal with the challenges posed by our want for bisection bandwidth.
Within the subsequent few years, we anticipate larger injection bandwidth on the order of a terabyte per second, per accelerator, with equal normalized bisection bandwidth. This represents a development of greater than an order of magnitude in comparison with as we speak’s networks!
To help this development, we want a high-performance, multi-tier, non-blocking community material that may make the most of fashionable congestion management to behave predictably underneath heavy load. This can allow us to totally leverage the facility of our AI clusters and guarantee they proceed to carry out optimally as we push the boundaries of what’s attainable with AI.
Scaling AI at this pace requires open {hardware} options. Creating new architectures, community materials, and system designs is essentially the most environment friendly and impactful after we can construct it on ideas of openness. By investing in open {hardware}, we unlock AI’s full potential and propel ongoing innovation within the subject.
Introducing Catalina: Open Structure for AI Infra
In the present day, we introduced the upcoming launch of Catalina, our new high-powered rack designed for AI workloads, to the OCP group. Catalina is predicated on the NVIDIA Blackwell platform full rack-scale solution, with a concentrate on modularity and adaptability. It’s constructed to help the newest NVIDIA GB200 Grace Blackwell Superchip, guaranteeing it meets the rising calls for of recent AI infrastructure.
The rising energy calls for of GPUs means open rack options must help greater energy functionality. With Catalina we’re introducing the Orv3, a high-power rack (HPR) able to supporting as much as 140kW.
The complete resolution is liquid cooled and consists of an influence shelf that helps a compute tray, change tray, the Orv3 HPR, the Wedge 400 material change, a administration change, battery backup unit, and a rack administration controller.
We goal for Catalina’s modular design to empower others to customise the rack to fulfill their particular AI workloads whereas leveraging each current and rising business requirements.
The Grand Teton Platform now helps AMD accelerators
In 2022, we introduced Grand Teton, our next-generation AI platform (the follow-up to our Zion-EX platform). Grand Teton is designed with compute capability to help the calls for of memory-bandwidth-bound workloads, comparable to Meta’s deep learning recommendation models (DLRMs), in addition to compute-bound workloads like content material understanding.
Now, we have now expanded the Grand Teton platform to help the AMD Intuition MI300X and shall be contributing this new model to OCP. Like its predecessors, this new model of Grand Teton incorporates a single monolithic system design with absolutely built-in energy, management, compute, and material interfaces. This excessive degree of integration simplifies system deployment, enabling speedy scaling with elevated reliability for large-scale AI inference workloads.
Along with supporting a spread of accelerator designs, now together with the AMD Intuition MI300x, Grand Teton gives considerably larger compute capability, permitting quicker convergence on a bigger set of weights. That is complemented by expanded reminiscence to retailer and run bigger fashions domestically, together with elevated community bandwidth to scale up coaching cluster sizes effectively.
Open Disaggregated Scheduled Material
Creating open, vendor-agnostic networking backend goes to play an necessary position going ahead as we proceed to push the efficiency of our AI coaching clusters. Disaggregating our community permits us to work with distributors from throughout the business to design techniques which are modern in addition to scalable, versatile, and environment friendly.
Our new Disaggregated Scheduled Fabric (DSF) for our next-generation AI clusters gives a number of benefits over our current switches. By opening up our community material we will overcome limitations in scale, part provide choices, and energy density. DSF is powered by the open OCP-SAI commonplace and FBOSS, Meta’s personal community working system for controlling community switches. It additionally helps an open and commonplace Ethernet-based RoCE interface to endpoints and accelerators throughout a number of GPUS and NICS from a number of completely different distributors, together with our companions at NVIDIA, Broadcom, and AMD.
Along with DSF, we have now additionally developed and constructed new 51T material switches primarily based on Broadcom and Cisco ASICs. Lastly, we’re sharing our new FBNIC, a brand new NIC module that incorporates our first Meta-design community ASIC. In an effort to meet the rising wants of our AI
Meta and Microsoft: Driving Open Innovation Collectively
Meta and Microsoft have a long-standing partnership inside OCP, starting with the event of the Switch Abstraction Interface (SAI) for knowledge facilities in 2018. Through the years collectively, we’ve contributed to key initiatives such because the Open Accelerator Module (OAM) commonplace and SSD standardization, showcasing our shared dedication to advancing open innovation.
Our present collaboration focuses on Mount Diablo, a brand new disaggregated energy rack. It’s a cutting-edge resolution that includes a scalable 400 VDC unit that enhances effectivity and scalability. This modern design permits extra AI accelerators per IT rack, considerably advancing AI infrastructure. We’re excited to proceed our collaboration via this contribution.
The open way forward for AI infra
Meta is committed to open source AI. We consider that open supply will put the advantages and alternatives of AI into the fingers of individuals everywhere in the phrase.
AI gained’t notice its full potential with out collaboration. We’d like open software program frameworks to drive mannequin innovation, guarantee portability, and promote transparency in AI growth. We should additionally prioritize open and standardized fashions so we will leverage collective experience, make AI extra accessible, and work in the direction of minimizing biases in our techniques.
Simply as necessary, we additionally want open AI {hardware} techniques. These techniques are needed for delivering the sort of high-performance, cost-effective, and adaptable infrastructure needed for AI development.
We encourage anybody who needs to assist advance the way forward for AI {hardware} techniques to have interaction with the OCP group. By addressing AI’s infrastructure wants collectively, we will unlock the true promise of open AI for everybody.