The paradigm shift towards bypassing fine tuning
in our previous articleWe reviewed the core concepts of GPT-1, as well as what had inspired it. By combining autoregressive language modeling pretraining with the Transformer-only decoder, GPT-1 had revolutionized the field of NLP and made preworkout plus fit a standard paradigm.
But OpenAI didn't stop there.
Rather, while trying to understand why pre-training the Transformers language model is effective, they began to notice the zero-firing behaviors of GPT-1, where as pre-training progressed, the model was able to steadily improve its performance on the tasks. which had not been adjusted, showing that previous training could improve its zero shooting ability, as shown in the following figure:
This motivated the paradigm shift of “preworkout plus fit” to ““pre-workout only”or in other words, a task-independent pre-trained model that can handle different tasks untuned.
Both GPT-2 and GPT-3 are designed following this philosophy.
But why, you might ask, is it not the preworkout plus fit magic Does it work well? What are the additional benefits of avoiding the adjustment stage?
Limitations of fine tuning
Fine-tuning works well for some well-defined tasks, but not all, and the problem is that there are numerous tasks in the NLP domain that we have never had the opportunity to experiment with yet.
For those tasks, the requirement for a tuning stage means that we will need to collect a tuning data set of significant size for each individual new task, which is clearly not ideal if we want our models to be truly intelligent one day.
Meanwhile, in some work, researchers have noted that there is an increasing risk of exploiting spurious correlations in the fitting data as the models we use become larger and larger. This creates a paradox: the model must be large enough so that it can absorb as much information as possible during training, but fitting such a large model on a small, tightly distributed data set will cause it to struggle when generalizing outside. distribution. samples.
Another reason is that as humans we don't need large supervised data sets to learn most language tasks, and if we want our models to be useful one day, we'd like them to have that fluency and generality as well.
Now, perhaps the real question is: what can we do to achieve that goal and avoid adjustment?
Before we delve into the details of GPT-2 and GPT-3, let's first take a look at the three key elements that have influenced the design of their model: task-independent learning, the scaling hypothesis, and learning in context. .
Task independent learning
Task-independent learning, also known as Metalearning either Learning to learnrefers to a new paradigm in machine learning in which the model develops a broad set of skills at training time and then uses these skills at inference time to quickly adapt to a new task.
For example, in MALM (Model Agnostic Meta-Learning), the authors showed that models could adapt to new tasks with very few examples. More specifically, during each inner loop (highlighted in blue), the model first samples a task from a set of tasks and performs some gradient descent steps, resulting in an adapted model. This adapted model will be evaluated on the same task in the outer loop (highlighted in orange) and then the loss will be used to update the model parameters.
MAML shows that learning could be more general and more flexible, which aligns with the direction of avoiding adjustment in each individual task. In the figure below, the authors of GPT-3 explained how this idea can be extended to learning language models when combined with in-context learning, with the outer loop iterating through different tasks, while the inner loop is describe using learning in contextwhich will be explained in more detail in later sections.
The scale hypothesis
As perhaps the most influential idea behind the development of GPT-2 and GPT-3, the scaling hypothesis refers to observations that by training on larger data, large models could somehow develop new capabilities automatically without supervision. explicit, or in other words, emergent The abilities could occur by scaling up, as we saw in the zero-shot abilities of the previously trained GPT-1.
Both GPT-2 and GPT-3 can be considered experiments to test this hypothesis, with GPT-2 set to test whether a larger model previously trained on a larger data set could be used directly to solve subsequent tasks, and GPT-3. 3 is set to test whether in-context learning could bring improvements over GPT-2 when scaled up further.
We will discuss more details about how they implemented this idea in later sections.
Learning in context
As we show in Figure 3, in the context of language models, in-context learning refers to the inner loop of the meta-learning process, where the model receives a natural language instruction and some demonstrations of the task at the time of execution. inference. and then it is expected to complete that task by automatically discovering the patterns in the given proofs.
Note that learning in context occurs in the testing phase. no gradient updates performedwhich is completely different from traditional tuning and more similar to how humans perform new tasks.
In case you are not familiar with the terminology, demonstrations usually means exemplary input-output pairs associated with a particular task, as we show in the “examples” part in the following figure:
The idea of learning in context was explored implicitly in GPT-2 and then more formally in GPT-3, where the authors defined three different configurations: zero shot, single shot, and few shots, depending on how many demonstrations are performed. given to the model.
In short, Task-independent learning highlights the potential to avoid adjustment, while the scaling hypothesis and learning in context suggest a practical path to achieve this.
In the following sections, we will discuss more details for GPT-2 and GPT-3, respectively.