Training Trillion-Parameter Models on AMD Hardware: Overcoming Challenges

Training large-scale AI models efficiently across hundreds of nodes is a challenge, especially when most workloads are optimized for Nvidia hardware and CUDA. However, researchers at Oak Ridge National Laboratory (ORNL) have made significant strides in training one trillion parameter models using AMD’s MI250X GPUs.

The MI250X is a powerful compute engine for traditional high-performance computing (HPC) workloads, offering impressive double precision floating point performance. While double precision is unnecessary for AI workloads, the MI250X still delivers a respectable 383 teraflops of performance when operating at FP16 precision. With 37,888 GPUs at their disposal, the ORNL team aimed to optimize the Frontier supercomputer for training large AI models.

ORNL faced multiple challenges in this endeavor. Firstly, they had to port their project to AMD’s ROCm runtime, which required collaboration with AMD developers. Converting CUDA code to a vendor-agnostic format is not a simple task, but progress is being made in this area.

Additionally, the researchers found that scaling tensor parallelism across nodes led to latency bottlenecks. The best results were achieved by limiting tensor parallelism to a single node with eight GPUs. Furthermore, the team implemented the ZeRO-1 optimizer to reduce memory overheads and utilized the Amazon Web Services’ ROCm collective communication library (RCCL) plug-in for improved communication stability between nodes.

When it came to efficiency, weak scaling (increasing the number of GPUs for a fixed problem size) proved to be 100 percent efficient. However, scaling against a fixed problem size led to diminishing returns due to various bottlenecks.

While ORNL’s success in training large AI models on AMD hardware is commendable, there is still work to be done to enhance the performance of these workloads. Most training frameworks are designed for Nvidia hardware, and support for the ROCm platform remains limited. Nevertheless, the lessons learned from this experiment can serve as a blueprint for other facilities operating non-Nvidia, non-CUDA-based systems, offering hope for wider adoption of AMD hardware in AI training.

The source of the article is from the blog girabetim.com.br