Many times, we want the LLM to receive an instruction once and then follow it until told otherwise. However, as the following example shows, LLMs can quickly forget instructions after a few turns of dialogue.
One way to get the model to pay attention consistently is to add the statement to each user message. While this will work, it comes at the cost of putting more tokens into context, which limits the number of dialogue turns your LLM can have. How do we solve this? By tuning! Ghost Attention is intended to allow the LLM to follow instructions for more turns of dialogue.
Let's start by imagining our dialogues as a data matrix. We have a user message, followed by a wizard message, and the two go back and forth. When the last element in our array is a user message, we would expect the LLM to generate a message as a wizard.
Importantly, we made sure that the instruction does not appear in any of the user's messages except the first one, since in the real world this is probably the only time a user would enter instructions organically.
Also in our setup is a Reinforcement Learning Human Feedback (RLHF) model that we can sample and know what a good response to the message would look like.
With our sample and dialogue, we perform rejection sampling: we ask the LLM to generate an arbitrary number of different answers and then score them with the RLHF model. We keep the highest ranked answer and use all of these higher quality answers to fit the model.
When we tune our dialogue and our best sample, we set the loss to zero for all tokens generated in previous dialogue turns. As far as I can tell, this was done when researchers noticed this performance improvement.
It's worth noting that while Ghost Attention will interact with the self-attention mechanism used for Transformer models, Ghost Attention is not itself a replacement for self-attention, but rather a way to give the self-attention mechanism better data so that remember. instructions given from the beginning in longer contexts.
LLaMa 2's article highlights three specific types of instructions with which they tested this: (1) act like a public figure, (2) speak in a certain language, and (3) enjoy specific hobbies. Since the set of possible public figures and hobbies is large, they wanted to avoid the LLM being assigned a hobby or a person that was not present in the training data. To solve this, they asked the LLM to generate the list of hobbies and public figures that they would then direct him to act on; Hopefully, if you generated the subject, you were more likely to know things about him and therefore less likely to hallucinate. To further improve the data, they would make the instructions as concise as possible. It is not discussed whether there are limits to the types of instructions that could be given, so presumably it is up to us to test which types of instructions work best in models fine-tuned using phantom attention.
So what are the effects of this new method on the LLM?
In the article, they attach the image above that shows how the model reacts to instructions not found in their fine-tuning data set. On the left, they test the instruction to “always respond with Haiku,” and on the right they test the instruction to suggest architecture-related activities when possible. While the haiku responses seem to lose a few syllables as they progress, there is no doubt that an attempt is made to maintain the general format in each response. The architecture one is especially interesting to me, as you can see the model doesn't mention this in the first message when it's not relevant, but does mention it later.
Try this yourself in the lmsys.org llama-2 interface. You can see that while it is not as perfect as shown in the screenshots in the article, it is still much better than the LLaMa 1 versions.
Importantly, we also see that this methodology has an impact on model attention. Below is a heatmap graph of the attention the model pays to each token. The left and bottom sides of the graph show tokens being placed into the model. We don't see the top right side of the graph because it generates the rest, so tokens that would appear beyond the current token are not available to the model. As we generate more text, we can see that more tokens are available. Heatmaps show higher values with darker colors, so the darker the color, the more attention will be paid to those tokens. We can see that the 'Act like Oscar Wilde' tokens become progressively darker as we generate more tokens, suggesting that more and more attention is being paid to them.
The article tells us that after more than 20 turns, the context often fills up, causing attention problems. Interestingly, the graph they provide in the appendix also shows that as they continued to tune the model, the score assigned to it by the RLHF model continued to drop. It would be interesting to see if this was because the instructions were getting longer, due to their complexity for each subsequent batch, or if it was somehow related to a limitation of the data they were using to train the model. If it's the latter, then it's possible that with more training data you can go through more batches before seeing the score decrease. Either way, there may be diminishing returns when making adjustments through Ghost Attention.