A new AI-based method called SparseGPT can reduce Generative Pretrained Transformer (GPT) family models in one go to at least 50% scarcity

The Generative Pretrained Transformer (GPT) family of Large Language Models (LLMs) has shown amazing performance on many tasks. However, they are cumbersome to implement due to their high computing requirements. As a result, it is not surprising that there has been so much focus on model compression to reduce these expenses. Quantization, or decreasing the precision of the numerical representation of individual weights, has been the main emphasis of almost all previous GPT compression methods.

Model compression can be complemented by the pruning process, which removes unused parts of the network, ranging from individual weights (unstructured pruning) to larger chunks such as rows and columns of weight matrices (structured pruning). However, in the case of GPT scale models, considerable retraining of the model is required to recover from loss of accuracy due to removal of parts, which can be quite expensive. As a result, the precise pruning of GPT3 scale models has seen almost no development yet.

A new study from IST Austria and Neural Magic introduces SparseGPT, the first precise one-step pruning strategy that is well-suited to models with 10.1 trillion parameters. SparseGPT solves the pruning problem by treating it as a massive scale application of sparse regression. The algorithm is based on a new approximation sparse regression solver to solve a layered compression problem and is fast enough to run on the largest openly available GPT models (175B parameters) in a matter of hours on a single GPU. However, SparseGPT is accurate enough to lose only a small amount of accuracy after pruning.

When tested with the largest publicly accessible generative language models (OPT-175B and BLOOM-176B), the researchers found that running SparseGPT in a single step causes 50-60% sparseness with negligible loss of precision. , measured in terms of perplexity or zero-shooting accuracy.

Two crucial points emerge from their experimental results:

The 175 billion parameter variant of the OPT family can replace up to 60% of its parameters with sparse ones using SparseGPT with only a modest impact on accuracy. By contrast, at 30% spread, Magnitude Pruning collapses and is the only single baseline known to work at this scale.
In addition to the less restrictive 1:1 and 1:2 semi-structured scattering patterns, SparseGPT can reliably apply scattering in the more demanding 2:4 and 4:8 semi-structured scattering patterns, which are, however, compatible with the hardware. Although these patterns typically suffer from additional precision loss compared to the dense baseline, especially for smaller models, these sparse patterns can be directly exploited to gain computational speedups. The scarcity introduced by the suggested method also adds to the compression achieved by quantization.

The suggested approach is intriguing as it is completely local; it does not use any global gradient information and instead calculates weight updates that aim to maintain the input-output ratio for each layer. It is amazing that such sparse models can be directly identified in the “neighborhood” of previously trained dense models, whose output corresponds remarkably to that of the dense model.

They also found that the relative accuracy gap between the dense and sparse model variant narrows as model size increases to the point where inducing a 50% spread results in virtually zero drop in accuracy across the models. bigger. This is consistent with the observation that larger models are easier to disperse. The group hopes their findings will inspire others to focus on further compressing such large models.

review the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 13k+ ML SubReddit, discord channel, Y electronic newsletterwhere we share the latest AI research news, exciting AI projects, and more.

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech at the Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and has a strong interest in the scope of application of artificial intelligence in various fields. She is passionate about exploring new advances in technology and its real life application.