There have been great advances in language models recently, in part because they can perform tasks with great performance through learning in context (ICL), a process by which models are prompted for a few instances of input label pairs before performing the task on an unseen evaluation instance. In general, the success of models in learning in context is enabled by:
- Your use of semantic background knowledge from prior training to predict tags while following the format of examples in context (eg, see examples of movie reviews with “positive sentiment” and “negative sentiment” as tags and performance sentiment analysis using prior knowledge).
- Learn input tag assignments in context from the examples presented (eg, find a pattern where positive reviews should be assigned to one tag and negative reviews should be assigned to a different tag).
In “Larger language models do learning in context differently”, our goal is to learn how these two factors (semantic priors and input label assignments) interact with each other in ICL configuration, especially with respect to the scale of the language model being used. We investigated two configurations to study these two factors: ICL with reversed labels (Reversed-label ICL) and ICL with semantically unrelated labels (SUL-ICL). In reverse-label ICL, the labels of the in-context examples are reversed so that the input-label and semantic pre-assignments do not agree with each other. In SUL-ICL, in-context example tags are replaced with words that have no semantic relationship to the task presented in-context. We found that overriding prior knowledge is an emergent ability of model scaling, as is the ability to learn in context with semantically unrelated labels. We also found that instruction tuning strengthens the use of prior knowledge rather than increases the ability to learn input label assignments.
Design of experiments
For a diverse mix of data sets, we experimented on seven natural language processing (NLP) tasks that have been widely used: sentiment analysis, subjective/objective classification, question classification, duplicate question recognition, bond recognition, financial sentiment analysisand hate speech detection. We tested five families of linguistic models, Palm, Flan-PaLM, GPT-3, instructGPTand Codex.
flipped labels
In this experiment, the labels of the in-context examples are reversed, meaning that prior knowledge and input label assignments do not agree (for example, sentences containing positive feelings labeled “negative feelings”), which that allows us to study if the models can override their previous ones. In this configuration, models that can override prior knowledge and learn input label mappings in context should experience a performance decrease (since ground truth evaluation labels are not inverted).
We found that when no labels are flipped, larger models perform better than smaller models (as expected). But as we flip more and more labels, the performance of small models stays relatively flat, but large models experience large performance drops to well below random guesses (eg, 90% → 22.5% for code -davinci-002).
These results indicate that large models can override prior knowledge from prior training when conflicting input label assignments are presented in context. Small models cannot do this, so this ability is an emergent phenomenon of model scale.
Semantically unrelated tags
In this experiment, we replace the labels with semantically irrelevant labels (eg, for sentiment analysis, we use “foo/bar” instead of “negative/positive”), which means that the model can only perform ICL by learning of input-label mappings. If a model relies primarily on prior knowledge for ICL, its performance should decrease after this change, since it will no longer be able to use the semantic meanings of the labels to make predictions. A model that can learn input label mappings in context, on the other hand, could learn these semantically unrelated mappings and should not experience a major drop in performance.
In fact, we see that using semantically unrelated tags results in a larger performance drop for small models. This suggests that the smaller models rely primarily on their semantic background for ICL rather than learning from the presented input tag assignments. Large models, on the other hand, have the ability to learn input label mappings in context when the semantic nature of the labels is removed.
We also find that including more examples in context (i.e., instances) results in a greater performance improvement for large models than for small models, indicating that large models are better at learning from examples in context than are small models.
In the SUL-ICL configuration, larger models benefit more from additional examples than smaller models. |
setting instructions
Instruction tuning is a popular technique for improving model performance, which involves tuning models on various NLP tasks that are expressed as instructions (for example, “Question: What is the sentiment of the following sentence, ‘This movie is it cool?’ Answer: Positive” ). However, since the process uses natural language tags, an open question is whether it improves the ability to learn input tag mappings or strengthens the ability to recognize and apply semantic background knowledge. Both would lead to a performance improvement for standard ICL tasks, so it’s unclear which one occurs.
We study this question by running the same two configurations as before, only this time we focus on comparing standard language models (specifically, PaLM) with their instruction-fit variants (Flan-PaLM).
First, we find that Flan-PaLM is better than PaLM when using semantically unrelated tags. This effect is very prominent in small models, since Flan-PaLM-8B outperforms PaLM-8B by 9.6% and almost reaches PaLM-62B. This trend suggests that instruction tuning strengthens the ability to learn input label assignments, which is not particularly surprising.
Instruction-adjusted language models are better at learning input label assignments than pretraining-only language models. |
More interestingly, we saw that Flan-PaLM is actually worse than PaLM at tracking inverted labels, which means that instruction-tuned models were unable to override their prior knowledge (Flan-PaLM models do not reach random guesses with 100 % Labels Inverted, but PaLM models without instruction tuning can achieve 31% accuracy on the same setup). These results indicate that instruction fitting should increase the extent to which models rely on semantic antecedents when available.
Instruction-adjusted models are worse than pretraining-only models at learning to override semantic antecedents when presented with in-context inverted labels. |
Combined with the previous result, we conclude that although statement matching improves the ability to learn input label assignments, it further strengthens the use of semantic prior knowledge.
Conclusion
We examined the extent to which language models learn in context by utilizing prior knowledge learned during pretraining against input label assignments presented in context.
We first show that long-language models can learn to override prior knowledge when presented with enough inverted labels, and that this ability emerges with model scale. We then discovered that successfully ICLing using semantically unrelated tags is another emergent ability of model scaling. Finally, we analyzed instruction-tuned language models and found that instruction tuning improves the ability to learn input label assignments, but also further strengthens the use of prior semantic knowledge.
Future work
These results underscore how the ICL behavior of language models can change depending on their scale, and that larger language models have an emergent ability to assign inputs to many types of tags, a form of reasoning in which assignments of input labels can potentially be learned for arbitrary symbols. Future research could help provide insights into why these phenomena occur with respect to the scale of the model.
Thanks
This work was done by Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. We would like to thank Sewon Min and our fellow contributors at Google Research for their helpful advice and discussions.