New Strategies for Efficient Execution of Large Language Models on Consumer Hardware

In recent years, the widespread adoption of Large Language Models (LLMs) has created a need for efficient ways to run these models on consumer hardware. One promising approach involves using sparse mixture-of-experts (MoE) architectures, which allow for faster token generation compared to denser counterparts. However, the challenge lies in the increased model size due to the presence of multiple “experts,” making it difficult to execute these models without high-end GPUs.

To address this challenge, a recent paper proposes a novel strategy that capitalizes on the inherent properties of MoE LLMs. The authors delve into the problem of running large MoE language models on consumer hardware and explore two main avenues for optimization: compressing model parameters and offloading them to a less expensive storage medium, such as RAM or SSD. It is worth noting that these optimizations primarily target inference rather than training.

One of the key strategies introduced in the paper is parameter offloading, which involves moving model parameters to a cheaper memory and loading them just in time when needed for computation. This approach is particularly effective for deep learning models with a fixed layer order, allowing for pre-dispatch of the next layer’s parameters in the background.

The paper also introduces the concept of Expert Locality and LRU Caching, leveraging the pattern observed in MoE models where individual experts are assigned to distinct sub-tasks. By keeping active experts in GPU memory as a “cache” for future tokens, the authors observe a significant speedup in inference for modern MoE models.

To address the challenge of expert loading time, the authors propose Speculative Expert Loading. This approach involves guessing the likely next experts based on the gating function of the previous layer’s hidden states, thereby speeding up the inference process for the next layer.

Another strategy explored in the paper is MoE Quantization, where compressed models take less time to load onto the GPU. The authors utilize Half Quadratic Quantization (HQQ) for its data-free quantization capabilities, achieving better quality-size trade-offs when quantizing experts to a lower bitwidth.

Overall, the evaluation of the proposed strategies using popular MoE models shows a significant increase in generation speed on consumer-grade hardware. These optimizations make large MoE models more accessible for research and development, opening up new possibilities for their practical application.