Advances in open source methods for instruction tuning – Google AI Blog

Posted by Shayne Longpre, Student Researcher, and Adam Roberts, Senior Software Engineer, Google Research, Brain Team

Language models are now capable of many new natural language processing (NLP) by reading instructions, often not seen before. The ability to reason about new tasks is primarily attributed to training models on a wide variety of single instructions, known as “instruction tuning,” which was introduced by FLAN and extended in T0, supernatural instructions, MetaICLand instructGPT. However, much of the data driving these advances remains unpublished to the broader research community.

In “The Flan Collection: Design data and methods for effective instructional setting”, we closely examined and release a newer and broader collection of publicly available tasks, templates, and methods for instruction tuning to improve the community’s ability to analyze and improve instruction tuning methods. This collection was used for the first time in Flan-T5 and Flan-PaLM, for which the latter achieved significant improvements over PaLM. We show that training a model on this collection produces improved performance over comparable public collections across all tested benchmarks, for example, a 3%+ improvement on all 57 tasks in the Mass multitasking language understanding (MMLU) evaluation suite and 8% improvement in bigbench hard (BBH). The analysis suggests that the improvements stem from both the broader and more diverse set of tasks and the application of a set of simple training and data augmentation techniques that are inexpensive and easy to implement: combining zero attempts, few attempts, and training chain of thought, input investment task enrichment, and task mix balancing. Together, these methods allow the resulting language models to reason more competently about arbitrary tasks, even those for which you haven’t seen any fitting examples. We hope that these findings and resources will be made publicly available to accelerate the investigation of more powerful and general-purpose linguistic models.

Public Instruction Adjustment Data Collections

Since 2020, several instruction tuning task collections have been released in quick succession, as shown in the timeline below. Recent research has yet to coalesce around a unified set of techniques, with different sets of tasks, model sizes, and input formats all represented. This new collection, hereinafter called “Flan 2022”, combines previous FLAN collections, P3/T0and Natural Instructions with new dialogues, program synthesis and complex reasoning tasks.

A timeline of public instruction tuning collections, including: unified quality control, cross fit, Natural InstructionsCUSTARD, P3/T0, MetaICL, ExT5, supernatural instructions, mT0, unnatural instructions, self instructionand OPT-IML Bank. The table describes the release date, the name of the task collection, the name of the model, the base models that were fitted with this collection, the size of the model, whether the resulting model is Public (green) or Non-Public (red). ), whether they train with zero shot (“ZS”), low shot (“FS”), chain of thought (“CoT”) prompts together (“+”) or separately (“/”), the number of tasks in this collection in Flan 2022, the total number of examples, and some notable collection-related methods used in these works. Please note that the number of tasks and examples varies for different assumptions, as do the approximations. The counts for each are reported using task definitions from the respective jobs.

In addition to escalating to more instructive training tasks, The Flan Collection combines training with different types of input and output specifications, including instructions only (zero attempts prompts), instructions with task examples (few attempts prompts), and instructions that ask for an explanation with the answer (thought prompt chain). Except for instructGPTLeveraging a proprietary data collection, Flan 2022 is the first paper to publicly demonstrate the great benefits of mixing these cue setups during training. Instead of a trade-off between the various settings, combining prompt settings during training improves all prompt settings at the time of inference, as shown below for retained and retained tasks from the fine-tuning task set .

Co-training with zero shot and low shot prompt templates improves performance on both held and held tasks. The stars indicate the maximum performance in each setting. The red lines indicate the requested evaluation of zero shot, the purple indicates the requested evaluation of few shots.

Evaluation of instruction setting methods

To understand the overall effects of swapping one instruction fitting collection for another, we fit T5 models of equivalent size in popular public instruction fitting collections, including Flan 2021, T0++, and Super-Natural Instructions. Each model is then tested on a set of tasks that are already included in each of the Instruction Fitting Collections, a set of five Chain of Thought tasks, and then a set of 57 miscellaneous tasks from the MMLU reference point, both with warnings of zero trips and few trips. In each case, the new Flan 2022 model, Flan-T5, surpasses these previous works, demonstrating a more powerful general purpose NLP reasoner.

Compare public instruction fit collections in retained, chain-of-thought, and retained assessment suites, such as bigbench hard and MMLU. All models except OPT-IML-Max (175B) are trained by us, using T5-XL with 3B parameters. Green text indicates an improvement over the next best comparable model T5-XL (3B).

One-task fine tuning

In applied settings, practitioners typically implement NLP models tuned specifically for a target task, where training data is already available. We looked at this setup to understand how the Flan-T5 stacks up against the T5 models as a starting point for applied professionals. Three configurations are compared: fitting T5 directly in the target task, using Flan-T5 without further adjustment in the target task, and fitting Flan-T5 in the target task. For both held and held tasks, Flan-T5 fine tuning offers an improvement over T5 fine tuning directly. In some cases, usually when training data is limited for a target task, Flan-T5 without further adjustment outperforms T5. with direct tuning.

Flan-T5 outperforms T5 in single task fine tuning. We compared T5 with single-task fine adjustment (blue bars), Flan-T5 with single-task fine adjustment (red), and Flan-T5 without any additional fine adjustment (beige).

An added benefit of using Flan-T5 as a starting point is that training is significantly faster and cheaper, converges more quickly than T5’s fine tuning, and generally peaks more accurately. This suggests that less task-specific training data may be necessary to achieve similar or better results on a particular task.

Flan-T5 converges faster than T5 in single-task fine tuning, for each of the five retained Flan fine-tuning tasks. The Flan-T5 learning curve is indicated by solid lines and the T5 learning curve by dashed lines. All tasks are retained during Flan fine tuning.

There are significant energy efficiency benefits to the NLP community in adopting instruction-adjusted models such as Flan-T5 for single-task fine-tuning, rather than conventional models without instruction-adjustment. While pre-training and instructional adjustment are financially and computationally expensive, they are a one-time cost, typically amortized in millions of subsequent adjustment runs, which can get more expensive overall, for the more prominent models. Instruction-tuned models offer a promising solution by significantly reducing the number of tuning steps required to achieve the same or better performance.

Conclusion

The new Flan statement tuning collection unifies the previous most popular public collections and their methods, while adding new templates and simple improvements like training with mixed prompt configurations. The resulting method outperforms Flan, P3, and Super-Natural Instructions on retention, chain-of-thought, MMLU, and BBH benchmarks by 3–17% in zero-shot and few-shot variants. The results suggest that this new collection serves as a higher performance starting point for researchers and practitioners interested in both generalizing to new instructions and fine-tuning a single new task.

Thanks

It was a privilege to work with Jason Wei, Barret Zoph, Le Hou, Hyung Won Chung, Tu Vu, Albert Webson, Denny Zhou, and Quoc V Le on this project.

Advances in open source methods for instruction tuning – Google AI Blog

Technical Terrence Team

LevelField Financial to Become the First FDIC-Insured Bank to Offer Traditional Banking and Bitcoin

Leave a Reply Cancel reply

Recommended.

Startup founder sentenced to 18 months in prison for fraud

Bitcoin Nears New ATH Amid Positive Seasonality in Q4: Report

Build an end-to-end RAG solution using knowledge bases for Amazon Bedrock and AWS CDK

It's time: switch to No Pages in Google Docs

Australia to Postpone Implementation of Crypto Regulations Due to Investor Exit

Categories

Important Links

Advances in open source methods for instruction tuning – Google AI Blog

Public Instruction Adjustment Data Collections

Evaluation of instruction setting methods

One-task fine tuning

Conclusion

Thanks

Related

Technical Terrence Team

LevelField Financial to Become the First FDIC-Insured Bank to Offer Traditional Banking and Bitcoin

Leave a Reply Cancel reply

Recommended.

Startup founder sentenced to 18 months in prison for fraud

Bitcoin Nears New ATH Amid Positive Seasonality in Q4: Report

Build an end-to-end RAG solution using knowledge bases for Amazon Bedrock and AWS CDK

It's time: switch to No Pages in Google Docs

Australia to Postpone Implementation of Crypto Regulations Due to Investor Exit

Categories

Important Links

Get daily news updates to your inbox!