Language models are now capable of many new natural language processing (NLP) by reading instructions, often not seen before. The ability to reason about new tasks is primarily attributed to training models on a wide variety of single instructions, known as “instruction tuning,” which was introduced by FLAN and extended in T0, supernatural instructions, MetaICLand instructGPT. However, much of the data driving these advances remains unpublished to the broader research community.
In “The Flan Collection: Design data and methods for effective instructional setting”, we closely examined and release a newer and broader collection of publicly available tasks, templates, and methods for instruction tuning to improve the community’s ability to analyze and improve instruction tuning methods. This collection was used for the first time in Flan-T5 and Flan-PaLM, for which the latter achieved significant improvements over PaLM. We show that training a model on this collection produces improved performance over comparable public collections across all tested benchmarks, for example, a 3%+ improvement on all 57 tasks in the Mass multitasking language understanding (MMLU) evaluation suite and 8% improvement in bigbench hard (BBH). The analysis suggests that the improvements stem from both the broader and more diverse set of tasks and the application of a set of simple training and data augmentation techniques that are inexpensive and easy to implement: combining zero attempts, few attempts, and training chain of thought, input investment task enrichment, and task mix balancing. Together, these methods allow the resulting language models to reason more competently about arbitrary tasks, even those for which you haven’t seen any fitting examples. We hope that these findings and resources will be made publicly available to accelerate the investigation of more powerful and general-purpose linguistic models.
Public Instruction Adjustment Data Collections
Since 2020, several instruction tuning task collections have been released in quick succession, as shown in the timeline below. Recent research has yet to coalesce around a unified set of techniques, with different sets of tasks, model sizes, and input formats all represented. This new collection, hereinafter called “Flan 2022”, combines previous FLAN collections, P3/T0and Natural Instructions with new dialogues, program synthesis and complex reasoning tasks.
A timeline of public instruction tuning collections, including: unified quality control, cross fit, Natural InstructionsCUSTARD, P3/T0, MetaICL, ExT5, supernatural instructions, mT0, unnatural instructions, self instructionand OPT-IML Bank. The table describes the release date, the name of the task collection, the name of the model, the base models that were fitted with this collection, the size of the model, whether the resulting model is Public (green) or Non-Public (red). ), whether they train with zero shot (“ZS”), low shot (“FS”), chain of thought (“CoT”) prompts together (“+”) or separately (“/”), the number of tasks in this collection in Flan 2022, the total number of examples, and some notable collection-related methods used in these works. Please note that the number of tasks and examples varies for different assumptions, as do the approximations. The counts for each are reported using task definitions from the respective jobs. |
In addition to escalating to more instructive training tasks, The Flan Collection combines training with different types of input and output specifications, including instructions only (zero attempts prompts), instructions with task examples (few attempts prompts), and instructions that ask for an explanation with the answer (thought prompt chain). Except for instructGPTLeveraging a proprietary data collection, Flan 2022 is the first paper to publicly demonstrate the great benefits of mixing these cue setups during training. Instead of a trade-off between the various settings, combining prompt settings during training improves all prompt settings at the time of inference, as shown below for retained and retained tasks from the fine-tuning task set .
Evaluation of instruction setting methods
To understand the overall effects of swapping one instruction fitting collection for another, we fit T5 models of equivalent size in popular public instruction fitting collections, including Flan 2021, T0++, and Super-Natural Instructions. Each model is then tested on a set of tasks that are already included in each of the Instruction Fitting Collections, a set of five Chain of Thought tasks, and then a set of 57 miscellaneous tasks from the MMLU reference point, both with warnings of zero trips and few trips. In each case, the new Flan 2022 model, Flan-T5, surpasses these previous works, demonstrating a more powerful general purpose NLP reasoner.
Compare public instruction fit collections in retained, chain-of-thought, and retained assessment suites, such as bigbench hard and MMLU. All models except OPT-IML-Max (175B) are trained by us, using T5-XL with 3B parameters. Green text indicates an improvement over the next best comparable model T5-XL (3B). |
One-task fine tuning
In applied settings, practitioners typically implement NLP models tuned specifically for a target task, where training data is already available. We looked at this setup to understand how the Flan-T5 stacks up against the T5 models as a starting point for applied professionals. Three configurations are compared: fitting T5 directly in the target task, using Flan-T5 without further adjustment in the target task, and fitting Flan-T5 in the target task. For both held and held tasks, Flan-T5 fine tuning offers an improvement over T5 fine tuning directly. In some cases, usually when training data is limited for a target task, Flan-T5 without further adjustment outperforms T5. with direct tuning.
An added benefit of using Flan-T5 as a starting point is that training is significantly faster and cheaper, converges more quickly than T5’s fine tuning, and generally peaks more accurately. This suggests that less task-specific training data may be necessary to achieve similar or better results on a particular task.
There are significant energy efficiency benefits to the NLP community in adopting instruction-adjusted models such as Flan-T5 for single-task fine-tuning, rather than conventional models without instruction-adjustment. While pre-training and instructional adjustment are financially and computationally expensive, they are a one-time cost, typically amortized in millions of subsequent adjustment runs, which can get more expensive overall, for the more prominent models. Instruction-tuned models offer a promising solution by significantly reducing the number of tuning steps required to achieve the same or better performance.
Conclusion
The new Flan statement tuning collection unifies the previous most popular public collections and their methods, while adding new templates and simple improvements like training with mixed prompt configurations. The resulting method outperforms Flan, P3, and Super-Natural Instructions on retention, chain-of-thought, MMLU, and BBH benchmarks by 3–17% in zero-shot and few-shot variants. The results suggest that this new collection serves as a higher performance starting point for researchers and practitioners interested in both generalizing to new instructions and fine-tuning a single new task.
Thanks
It was a privilege to work with Jason Wei, Barret Zoph, Le Hou, Hyung Won Chung, Tu Vu, Albert Webson, Denny Zhou, and Quoc V Le on this project.