Large Language Models (LLMs) have gained significant importance in recent years, driving the need for efficient GPU utilization in machine learning tasks. However, researchers face a critical challenge in accurately assessing GPU performance. The commonly used metric, GPU utilization, accessed through nvidia-smi or built-in observability tools, has proven to be an unreliable indicator of actual computational efficiency. Surprisingly, 100% GPU utilization can be achieved by simply reading and writing to memory without performing any computation. This revelation has sparked a re-evaluation of performance metrics and methodologies in the field of machine learning, leading researchers to search for more accurate ways to measure and optimize GPU performance for LLM training and inference tasks.
Researchers have attempted to address the limitations of GPU utilization by introducing alternative metrics. One widely known approach is the utilization of model FLOPS (floating point operations per second), or MFU, introduced in Google’s PaLM paper. MFUs measure the ratio of observed performance to the theoretical peak performance of a system operating at peak FLOPs, providing a more accurate representation of GPU performance. This metric provides insight into how efficiently a workload utilizes the computational capabilities of a GPU. However, MFUs suffer from a drawback in their computational complexity as they are parameter and framework dependent. Despite this limitation, MFUs have revealed significant discrepancies between GPU utilization and actual computational efficiency. For example, some LLM trainings that achieved 100% GPU utilization were found to have only 20% MFU, well below the typical 35-45% range for most LLM training, highlighting the need for a deeper understanding of GPU performance metrics.
ai/blog/gpu-utilization-misleading”>ai researchers trained (a company specializing in GPU cluster management infrastructure) addressed the challenge of optimizing LLM training efficiency for a commodity model company. Their approach involved implementing a number of commonly recommended performance tuning techniques for PyTorch. These optimizations included saturating the GPU by tuning data loader parameters, maximizing tensor core usage through mixed-precision training, employing fused optimizers from apex or deepspeed, and using instances and networks specifically designed for training tasks. By applying these methods, Trainy successfully achieved 100% GPU utilization and significant power consumption, initially indicating improved performance. However, to gain a more complete understanding of actual computational efficiency, the team went a step further by calculating the model FLOPS utilization (MFU) of the training workload, recognizing the limitations of relying solely on GPU utilization as a performance metric.
GPU architecture is key to understanding the limitations of GPU utilization as a performance metric. GPUs consist of cores and multiprocessing managers (SMs on NVIDIA, CUs on AMD). The GH100 GPU, for example, has 144 SMs, each managing multiple CUDA cores. NVIDIA’s definition of GPU utilization is vague, while Datadog’s NVML documentation provides more clarity. However, this metric can be misleading, as it only indicates GPU activity, not computational efficiency. When a CUDA kernel is launched, work is distributed across cores by SM, but the utilization percentage does not reflect the intensity or effectiveness of these computations.
To further investigate the performance bottlenecks, the researchers turned to profiling the model training loop using PyTorch Profiler. This analysis revealed a critical insight: the Softmax kernel was recording high GPU utilization but low SM (Streaming Multiprocessor) efficiency. This discrepancy raised concerns as the naive Softmax implementation is a well-known bottleneck for large language models. The low SM efficiency indicated potential inefficiencies in model execution, despite high GPU utilization. This observation aligns with the limitations of relying solely on GPU utilization as a performance metric. To address these memory-bound operations, several kernel fusion techniques such as FlashAttention have been developed. The profiling results emphasized the need for a more nuanced approach to optimizing LLM training, focusing on improving SM efficiency along with GPU utilization.
SM efficiency, also known as SM activity, is a crucial metric for NVIDIA GPUs that measures the percentage of SMs that are active in a given time interval. For example, an NVIDIA H100 GPU contains 132 SMs, each managing 128 cores, for a total of 16,896 cores. This metric provides insight into how efficiently the CUDA cores are utilizing the available SMs. A CUDA core running continuously for 10 seconds but using only one SM on an H100 would show 100% GPU utilization but just 0.7% SM efficiency. This discrepancy highlights the importance of looking beyond GPU utilization. By monitoring SM efficiency layer-by-layer, researchers can identify potential optimization opportunities and easy-to-access opportunities in LLM training, enabling more targeted performance improvements and a more accurate assessment of computational efficiency.
To optimize LLM training, the researchers focused on fusing layers within the transformer block. This approach involves replacing native PyTorch layer definitions with GPU kernels implemented in CUDA or Triton, combining multiple layers into a single kernel. Optimization goals included Softmax (using Flash Attention), MLP, and dropout layer norm residual addition operations. These fused kernels, often available in libraries such as Flash Attention, offer improved performance and reduced memory usage.
Implementation challenges primarily involved identifying suitable layers for replacement, as Torch.Compile's automatic optimizations were incompatible with newer distributed strategies such as FSDP. Manual implementation of merged cores was necessary due to these limitations.
Optimization efforts yielded significant improvements: a 4x speedup in training time and an increase in model FLOPS utilization (MFU) from 20% to 38%. These gains were the result of implementing fused cores and fine-tuning model parallelism to effectively leverage the available 3.2 Tbps Infiniband infrastructure.
In this study, the researchers recommend tracking SM efficiency and GPU utilization on GPU clusters to accurately measure performance. While GPU utilization indicates whether the machine is idle, SM efficiency shows how effectively the GPU is being used. Calculating MFUs is beneficial, but complex for continuous monitoring. Nvidia’s Data Center GPU Manager (DCGM) tracks SM activity by default. Other metrics such as SM occupancy provide detailed information about the workload of each SM, but are more complex to interpret. For a deeper understanding, check out the Pytorch Profiler blog, DCGM documentation, and Nsight profiling guides.
Take a look at the ai/blog/gpu-utilization-misleading” target=”_blank” rel=”noreferrer noopener”>PaperAll credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram Channel and LinkedIn GrAbove!. If you like our work, you will love our fact sheet..
Don't forget to join our SubReddit of over 50,000 ml
Below is a highly recommended webinar from our sponsor: ai/webinar-nvidia-nims-and-haystack?utm_campaign=2409-campaign-nvidia-nims-and-haystack-&utm_source=marktechpost&utm_medium=banner-ad-desktop” target=”_blank” rel=”noreferrer noopener”>'Developing High-Performance ai Applications with NVIDIA NIM and Haystack'
Asjad is a consultant intern at Marktechpost. He is pursuing Bachelors in Mechanical Engineering from Indian Institute of technology, Kharagpur. Asjad is a Machine Learning and Deep Learning enthusiast who is always researching the applications of Machine Learning in the healthcare domain.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>