Stanford and Mila researchers propose Hyena: a careless drop-in replacement for the core building block of many large-scale language models

As we all know, the race to develop and create mind-blowing generative models like ChatGPT and Bard, and their underlying technology, like GPT3 and GPT4, has taken the AI world with a magnanimous force, there are still many challenges when it comes to accessibility. , training and real feasibility of these models in many use cases that concern our day-to-day problems.

If anyone has ever played with any of these sequence models, there is a sure problem that could have ruined your enthusiasm. That is, the length of the input that they can send to request the model.

If you are an enthusiast who wants to get into the core of such technologies and train your custom model, the whole optimization process makes it quite an impossible task.

🚀 JOIN the fastest ML subreddit community

At the heart of these problems is the quadratic nature of the optimization of attention models using sequence models. One of the main reasons is the computational cost of such algorithms and the resources required to solve this problem. It can be an extremely expensive solution, especially if someone wants to scale it up, leading to only a few concentrated organizations having a vivid sense of real understanding and control of such algorithms.

Simply put, attention exhibits a quadratic cost over the length of the sequence. Limiting the amount of accessible context and scaling it is an expensive business.

Don’t worry though; there is a new architecture called Hyena, which is now making waves in the NLP community, with people ordering it as the savior we all need. It challenges the dominance of existing care mechanisms, and research work demonstrates its potential to overthrow the existing system.

Developed by a team of researchers from a leading university, Hyena boasts impressive performance on a variety of sub-quad NLP tasks in terms of optimization. In this article, we will take a closer look at the hyena claims.

This paper suggests that sub-quadratic operators can match the quality of care models at scale without being as expensive in terms of parameters and optimization costs. Based on specific reasoning tasks, the authors distill the three most important properties that contribute to their performance.

Data control
Scaling of sublinear parameters
Unrestricted context.

Aiming with these points in mind, they then introduce the hierarchy of hyenas. This new operator combines long convolutions and element-wise multiplicative gates to equalize quality of care at scale and reduce computational cost.

The experiments carried out reveal amazing results.

Language modeling.

Hyena scaling was tested in autoregressive language modeling, which, when perplexedly evaluated on the WikiText103 and The Pile reference dataset, revealed that Hyena is the first attention-free convolution architecture to match GPT quality with a 20% reduction in total FLOPS.

Stumping on WikiText103 (same tokenizer). ∗ are results from (Dao et al., 2022c). The deeper and thinner models (Hyena-slim) achieve less perplexity

Perplexity in The Pile for models trained up to a total number of chips, say 5 billion (different runs for each total chips). All models use the same tokenizer (GPT2). FLOP count is for 15 billion token run

Large-Scale Image Classification

The paper demonstrates the potential of Hyena as a general deep learning operator for image classification. In image translation, they replace the attention layers in Vision Transformer (ViT) with the Hyena operator and combine performance with ViT.

In CIFAR-2D, we tested a 2D version of the Hyena long convolution filters on a standard convolutional architecture, which improves the accuracy of the 2D long convolutional model S4ND (Nguyen et al., 2022) with 8% speedup and 25% less than parameters.

Promising results at the sub-billion parameter scale suggest that attention may not be all we need and that simpler sub-quadratic designs like Hyena, informed by simple guiding principles and benchmarked evaluation of mechanistic interpretability, form the basis for efficient large models.

With the waves this architecture is creating in the community, it will be interesting to see if the Hyena gets the last laugh.

review the Paper and github link. Don’t forget to join our 20k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at [email protected]

🚀 Check out 100 AI tools at AI Tools Club

Data scientist currently working for S&P Global Market Intelligence. He worked as a data scientist for AI product startups. Reader and learner at heart.