Efficient Training on Supercomputers: NVIDIA vs. AMD and Intel

In a recent research paper, computer engineers at Oak Ridge National Laboratory detailed their successful training of a large language model (LLM) on the Frontier supercomputer. What’s notable is that they achieved impressive results while using only a fraction of the available GPUs. This raises questions about the efficiency of training performance across different hardware ecosystems.

The team utilized 3,072 out of the 37,888 Radeon Instinct GPUs on Frontier to train an LLM with one trillion parameters. They encountered challenges related to the massive amount of memory required for such a large model, requiring the use of multiple MI250X GPUs. However, this introduced a parallelism issue that needed to be addressed to fully utilize the additional GPU resources efficiently.

By iterating on frameworks like Megatron-DeepSpeed and FSDP, the researchers reconfigured the training program for optimal performance on the Frontier supercomputer. The results were impressive, with weak scaling efficiency reaching 100% and strong scaling efficiency at 87-89%. Strong scaling efficiency measures the ability to increase processor count without changing the workload size.

The research paper also highlights the disparities between the hardware ecosystems of NVIDIA, AMD, and Intel. Most machine learning at this scale is carried out within NVIDIA’s CUDA ecosystem, leaving AMD’s and Intel’s solutions underdeveloped in comparison. The paper acknowledges the need for more exploration of efficient training performance on AMD GPUs and the sparse nature of the ROCm platform.

Currently, Frontier remains the fastest supercomputer with all-AMD hardware, followed by the Intel-based Aurora. However, only half of the latter has been utilized for benchmark submissions so far. NVIDIA GPUs power the third fastest supercomputer, Eagle. To stay competitive, AMD and Intel must strive to catch up to NVIDIA’s software solutions.

This research not only sheds light on the successful training of large language models on supercomputers but also emphasizes the importance of developing efficient training performance across different hardware ecosystems. The pursuit of optimized training methodologies will foster the growth of AMD’s and Intel’s solutions in the field of machine learning.

The source of the article is from the blog mgz.com.tw