Stanford Researchers Present Gisting: A Novel Technique for Efficient Fast Compression in Language Models

Model specialization involves tailoring a pretrained machine learning model to a specific task or domain. In Language Models (LM), the specialization of the model is crucial to improve its performance in various tasks such as summarizing, question answering, translation, and language generation. The two main processes for specializing a language model to specific tasks are instruction fine-tuning (adaptation of a pretrained model to a new task or set of tasks) and model distillation (knowledge transfer from a “teacher” model). ” pretrained to a smaller, specialized model of “student”). Cueing is a key concept in the field of LM specialization, as it provides a way to guide the model towards specific behaviors, enables more efficient use of limited training data, and is crucial for achieving cutting-edge performance. Ad compression is a technique being studied with the hope of delivering substantial savings in compute, memory, and storage and no substantial decrease in overall performance or output quality.

This paper, presented by researchers at Stanford University, proposes a novel technique for ads compression called gisting, which trains an LM to compress ads into smaller sets of “basic” tokens. To reduce the cost of the hint, techniques such as fine tuning or distillation can be used to train a model that would behave like the original without the hint, but in that case, the model would have to be retrained for each new message, which is far from ideal. However, the idea behind Gist is to use a meta-learning approach to predict Gist tokens from a flag that would not require retraining the model for each task and would allow generalization to invisible instructions without additional training. This would come with a reduction in computational cost and would allow a message to be compressed, cached and reused for computing efficiency. It would also allow users to fit more content into the limited context window.

The authors experimented with a simple way to achieve such a model: they used the LM itself (leveraging its pre-existing knowledge) to predict essential tokens during instruction fine-tuning while modifying the Transformer’s attention masks. Given a pair (task, input), they add essential tokens between the task and the input and set the attention mask as follows: input tokens after essential tokens cannot serve any of the request tokens before the essential tokens (but you can pay attention to the essential tokens). Since the input and output cannot service the request, this forces the model to compress the request information into intermediate essential tokens.
To train the essential models, they needed a dataset with a large variety of tasks, so they created a dataset they called Alpaca+, which combined data from two existing instruction-fit datasets (Standford Alpaca and Self- Instruct) totaling over 130k examples. They then ran 3 validation splits to be able to validate the model after training which had visible, invisible and hand made human cues. In this way, they were able to prove generalization to invisible instructions, and the human division posed an even stronger generalization challenge. They also used multiple LM architectures (namely LLaMA-7Bm, a decoder-only GPT-style model, and FLAN-T5-XXL) and essential models trained with a variable number of essential tokens (1, 2, 5, or 10). However, the results showed that the models were generally insensitive to the number of essential tokens, and in some cases even showed that a higher number of tokens was actually detrimental to performance. Therefore, they used a single essential model for the rest of the experiments.

JOIN the fastest ML subreddit community

To assess the quality of fast compression, they calibrated performance with a positive control, which was effectively a standard instruction fine-tuning, providing an upper bound on performance, and a negative control in which the model would not have access to the instruction at all. resulting in random essential tokens, which provided a lower bound on performance. To compare the results of their models with the positive control and measure the win rate, they asked ChatGPT to choose which answer was better, explaining their reasoning. They also used a simple lexical overlap statistic called ROUGE-L (a metric that measures similarities between generated text and human-written instructions in fine-tuning of open-ended instructions). A 50% success rate indicates that the model is of comparable quality to a model that does not perform immediate compression.

The results showed that, according to Seen’s instructions, the essential models performed very closely to the positive control models with success rates of 48.6% (LLaMA) and 50.8% (FLAN-T5). Most importantly, they were able to show that the essential models had competitive generalizations for unseen indications, with success rates of 49.7% (LLaMA) and 46.2% (FLAN-T5). In the most challenging human division alone, they saw slight drops in win rates (but still competitive) at 45.8% (LLaMA) and 42.5% (FLAN-T5). The slightly worse performance of the FLAN-T5 and the particular cases of failure brought more hypotheses to be tested in future articles.

The researchers also investigated the potential efficiency gains that can be achieved through giting, which was the primary motivation for the study. The results were very encouraging, with essential caching leading to a 40% reduction in FLOPs and 4-7% less wall clock time compared to non-optimized models. While these improvements were found to be minor for the decoder language models only, the researchers also demonstrated that the essential models allowed for 26x compression of invisible hints, providing considerable additional space in the input context window.

Overall, these findings illustrate the significant potential of giting to improve both the effectiveness and efficiency of specialized language models. The authors also suggest several promising directions for follow-up work on the essence. For example, they stipulate that Gist’s greatest computational gains and efficiency will come from compressing longer hints and that “pre-training Gist” could improve compression performance by first learning to compress arbitrary stretches of natural language before Learn to understand directions.

review the Paper and Github. Don’t forget to join our 19k+ ML SubReddit, discord channel, and electronic newsletter, where we share the latest AI research news, exciting AI projects, and more. If you have any questions about the article above or if we missed anything, feel free to email us at asif@marktechpost.com

Check out 100 AI tools at AI Tools Club

Nathalie Crevoisier has a BS and MS in Physics from Imperial College London. She spent a year studying applied data science, machine learning, and internet analytics at the Ecole Polytechnique Federale de Lausanne (EPFL) as part of her degree. During her studies, she developed a keen interest in AI, which led her to join Meta (formerly Facebook) as a data scientist after graduation. During her four-year tenure with the company, Ella Nathalie worked across multiple teams, including Ads, Integrity, and Workplace, applying cutting-edge data science and ML tools to solve complex problems affecting billions of users. Seeking more independence and time to keep up with the latest AI discoveries, she recently decided to transition into a freelance career.

JOIN the fastest ML subreddit community