With the increasing growth of artificial intelligence (introduction of large language models (LLM) and generative ai), there has been a growing demand for more efficient graphics processing units (GPUs). GPUs are specialized hardware that is widely used for demanding computing tasks and are capable of executing calculations in parallel. Writing proper GPU kernels is important to use GPUs to their full potential. This task is quite time-consuming and complex, and requires deep experience in GPU architecture and some programming languages like C++, CUDA, etc.
Machine Learning ML compilers like TVM, Triton, and Mojo provide some automation, but still require manual handling of the GPU cores to get the optimal result. To achieve optimal results and avoid manual tasks, researchers from Carnegie Mellon University has developed Miragean innovative tool designed to automate the generation of high-performance GPU cores by finding and generating them. Kernels generated by Mirage can be used directly in PyTorch tensors and called in PyTorch programs. Users need to write a few lines of code in Mirage compared to traditional script, which uses many lines.
Mirage can be seen as a game changer of the future, which will achieve high productivity, better performance and greater correctness in ai applications. Writing manual code requires significant engineering expertise due to the complex nature of the GPU architecture, but Mirage simplifies the process by automatically generating cores, making it easier and simpler for engineers.
Manually written GPU kernels may have some bugs, making it difficult to achieve the required results, but research on Mirage has shown that kernels generated by Mirage are 1.2 to 2.5 times faster than the best code written by humans. Additionally, integrating Mirage into PyTorch reduces overall latency by 15-20%.
# Use Mirage to generate GPU kernels for attention
import mirage as mi
graph = mi.new_kernel_graph()
Q = graph.new_input(dims=(64, 1, 128), dtype=mi.float16)
K = graph.new_input(dims=(64, 128, 4096), dtype=mi.float16)
V = graph.new_input(dims=(64, 4096, 128), dtype=mi.float16)
A = graph.matmul(Q, K)
S = graph.softmax(A)
O = graph.matmul(S, V)
optimized_graph = graph.superoptimize()
The code in Mirage occupies few lines compared to the traditional method with many lines
All computations on GPUs are centered on cores, which are functions that run in parallel around multiple streaming multiprocessors (SMs) in the form of a single program multiple data (SPMD). The cores organize the computation into a grid of thread blocks, with each thread block executing on a single SM. Each block also has multiple threads to perform calculations on individual data elements.
The GPU follows a particular memory hierarchy with:
- Register file to quickly access data
- Shared memory: Shared by all threads of a block for efficient data exchange.
- Device memory: accessible to all threads in a kernel
The architecture is represented with the help of uGraph representation, which contains graphs at multiple levels: kernel level, thread block level and thread level with computation encapsulated at kernel level across the GPU, thread block level which addresses computation on a single-stream multiprocessor (SM) and threaded graph that addresses computation at the CUDA or Tensor Core level. uGraph provides a structured way to represent GPU calculations.
Four categories of GPU optimization:
1. Normalization + Linear
LLMs typically use LayernNorm, RMSNorm, GroupNorm, and BatchNorm techniques, which are often treated separately by ML compilers. This separation is because normalization techniques require both reduction and diffusion operations. These normalization layers can be merged with linear layers using matrix multiplication.
2. LoRA + Linear
It merges low-rank adaptation (LoRA), a technique for adapting pre-trained models to new tasks or data sets while reducing computational requirements with linear layers. It is 1.6 times faster than existing systems.
3. Closed MLP
It combines two MatMuls, SiLU activation and element multiplication. Closed MLP reduces kernel launch overhead and device memory access to 1.3 times faster than the best baseline.
4. Variants of care
to. Query key normalization
Chameleon, ViT-22B and Google's recent paper introduced query key normalization and merged LayerNorm into the focus core. This custom kernel also performs attention-tuned optimizations to existing GPUs with a 1.7x to 2.5x performance improvement.
b. Multi-headed latent attention
It optimizes memory usage by compressing the traditional key-value attention cache into a more compact latent vector. This change introduces two linear layers to attention. Mirage generates a custom kernel that integrates the linear layers with the attention mechanism into a single kernel. This avoids storing intermediate key-value vectors in the memory of the GPU device.
In conclusion, Mirage addresses the critical challenge of dealing with high-power GPU cores in advanced ai problems. It eliminates the problems of significant time investment, high coding experience, and error generation by providing the best optimal GPU cores running in a PyTorch-based environment. It also addresses gaps that manual computing might miss, accelerating the implementation of LLM and other ai technologies in real-world applications.
look at the GitHub page and Details. All credit for this research goes to the researchers of this project. Also, don't forget to follow us on twitter.com/Marktechpost”>twitter and join our Telegram channel and LinkedIn Grabove. If you like our work, you will love our information sheet..
Don't forget to join our SubReddit over 50,000ml
Are you interested in promoting your company, product, service or event to over 1 million ai developers and researchers? Let's collaborate!
Nazmi Syed is a Consulting Intern at MarktechPost and is pursuing a Bachelor of Science degree at the Indian Institute of technology (IIT) Kharagpur. He has a deep passion for data science and is actively exploring the broad applications of artificial intelligence in various industries. Fascinated by technological advancements, Nazmi is committed to understanding and implementing cutting-edge innovations in real-world contexts.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>