Machine learning and deep learning models perform remarkably in various tasks thanks to recent technological developments. However, this outstanding performance comes at a cost. Machine learning models often require large amounts of resources and computational power to achieve state-of-the-art accuracy, making scaling these models challenging. Also, because they are unaware of the performance limitations in their workloads, ML researchers and systems engineers are often unable to computationally scale their models. Often the amount of resources requested for a job is only sometimes what is actually needed. Understanding resource usage and bottlenecks for distributed training workloads is crucial to getting the most out of a model’s hardware stack.
The PyTorch team worked on this issue statement and recently released Holistic Trace Analysis (HTA), a Python performance analysis and visualization library. The library can be used to understand performance and identify bottlenecks in distributed training workloads. This is accomplished by reviewing the collected traces using the PyTorch Profiler, also known as Kineto. Kineto’s tracks are often difficult to understand; this is where HTA helps elevate the performance data found in these trails. The library was first used internally at Meta to better understand performance issues for extensive distributed training tasks on GPUs. The team then went to work improving several of HTA’s capabilities and scaling them to support cutting-edge machine learning workloads.
Several elements, such as how model operators interact with GPU devices and how those interactions can be measured, are considered to understand GPU performance in distributed training jobs. Three main kernel categories: Compute (COMP), Communication (COMM), and Memory (MEM) can be used to classify GPU processes throughout the execution of a model. All the mathematical operations carried out during the execution of the model are handled by computational cores. Instead, communication cores are responsible for synchronizing and transferring data between multiple GPU devices in a distributed training job. Memory cores control data transfer between host memory and GPUs, as well as memory allocations on GPU devices.
The performance evaluation of various GPU training jobs is highly dependent on how the model execution generates and coordinates the GPU cores. This is where the HTA library comes in, providing valuable insight into how model execution interacts with the GPU hardware and pointing out areas for speed improvement. The library seeks to give users a deeper understanding of the inner workings of distributed GPU training.
It can be difficult for ordinary people to understand how GPU training jobs work. This inspired the PyTorch team to create HTA, which streamlines the trace analysis process and provides the user with insights by looking at model execution traces. HTA uses the following features to support the above tasks:
Temporary breakdown: This feature provides a breakdown of the amount of time GPUs spend across all ranges in terms of compute, communication, memory events, and even idle time.
Core Breakdown: This function separates the time spent in each of the three kernel types (COMM, COMP, and MEM) and arranges the time spent in increasing order of duration.
Kernel lifetime distribution: The distribution of the average time spent by a specific kernel across all ranges can be visualized using bar charts produced by HTA. The graphs also show the minimum and maximum time that a given kernel spends in a particular range.
Communication Computation Overlay: When performing distributed training, many GPU devices must communicate and synchronize with each other, which requires a considerable amount of time. To achieve high GPU efficiency, it is essential to prevent a GPU from blocking while waiting for data from other GPUs. Calculating the overlap between computation and communication is one method of evaluating the amount of computation that is hampered by data dependencies. This feature offered by the library helps to calculate the percentage of time that communication and calculation overlap.
Increased counters (queue length, memory bandwidth): For debugging purposes, HTA creates augmented trace files that include statistics showing the memory bandwidth used, as well as the number of failed operations on each CUDA transmission (also known as queue length).
These key features give users an idea of how the system works and help to understand what is going on internally. The PyTorch team also intends to add more features in the near future that will explain why certain things happen and possible strategies to overcome bottlenecks. HTA has been made available as an open source library to serve a wider audience. It can be used for various purposes, including deep learning-based recommender systems, NLP models, and computer vision related tasks. Detailed documentation for the library can be found here.
review the GitHub Y Blog. All credit for this research goes to the researchers of this project. Also, don’t forget to join our reddit page, discord channel, Y electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.
Khushboo Gupta is a consulting intern at MarktechPost. He is currently pursuing his B.Tech at the Indian Institute of Technology (IIT), Goa. She is passionate about the fields of machine learning, natural language processing, and web development. She likes to learn more about the technical field by participating in various challenges.