How multimodality makes LLM alignment more challenging

About a month ago, OpenAI announced that ChatGPT can now see, hear, and speak. This means that the model can help you with more everyday tasks. For example, you can upload a photo of the contents of your refrigerator and ask for meal ideas to prepare with the ingredients you have. Or you can photograph your living room and ask ChatGPT for art and decorating tips.

This is possible because ChatGPT uses multimodal GPT-4 as the underlying model that can accept both images and text inputs. However, new capabilities pose new challenges for model alignment teams that we will discuss in this article.

The term “align LLMs” refers to training the model to behave according to human expectations. This often means understanding human instructions and producing responses that are helpful, accurate, safe, and unbiased. To teach the model the correct behavior, we provide examples that use two steps: supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF).

Supervised fine tuning (SFT) teaches the model to follow specific instructions. In the case of ChatGPT, this means providing examples of conversations. The underlying base model, GPT-4, can't do that yet because it was trained to predict the next word in a sequence, not to answer chatbot-like questions.

While SFT gives ChatGPT its 'chatbot' nature, its responses are still far from perfect. Therefore, reinforcement learning from human feedback (RLHF) is applied to improve the truthfulness, safety, and usefulness of responses. Basically, the instruction-tuned algorithm is asked to produce various responses that are then ranked by humans using the criteria mentioned above. This allows the reward algorithm to learn human preferences and is used to retrain the SFT model.

After this step, a model is aligned with human values, or at least we hope so. But why does multimodality make this process one step more difficult?

When we talk about alignment for multimodal LLMs, we should focus on images and text. It does not cover all of ChatGPT's new capabilities for ¨see, hear and speak¨ because the latter two use speech-to-text and text-to-speech models and are not directly connected to the LLM model.

So this is where things get a little more complicated. Images and text together are more difficult to interpret compared to text input alone. As a result, ChatGPT-4 hallucinates quite frequently about objects and people that it may or may not see in the images.

Gary Marcus wrote an excellent article on multimodal hallucinations that exposes different cases. One of the examples shows ChatGPT reading the time incorrectly from an image. He also had trouble counting chairs in a picture of a kitchen and couldn't recognize a person wearing a clock in a photo.

Picture of https://twitter.com/anh_ng8

Images as input also open a window for adversarial attacks. They can be part of fast injection attacks or used to pass instructions to jailbreak the model and produce harmful content.

Simon Willison documented several image injection attacks in this mail. One of the basic examples involves uploading an image to ChatGPT that contains new instructions that you want it to follow. See example below:

Picture of https://twitter.com/mn_google/status/1709639072858436064

Similarly, the text on the photo could be replaced with instructions for the model to produce hate speech or harmful content.

So why is multimodal data more difficult to align? Multimodal models are still in their early stages of development compared to unimodal language models. OpenAI did not reveal details of how multimodality is achieved in GPT-4 but it is clear that they have provided it with a large number of text-annotated images.

Text-image pairs are harder to obtain than purely textual data, there are fewer such curated data sets, and natural examples are harder to find on the Internet than plain text.

The quality of image-text pairs presents an additional challenge. An image with a text tag of a phrase is not as valuable as an image with a detailed description. To have the latter we often need ai/global-crowd/” rel=”noopener” target=”_blank”>human annotators who follow a set of carefully designed instructions to provide text annotations.

Additionally, training the model to follow instructions requires a sufficient amount of real-world prompts for the user using both images and text. Again, it is difficult to find organic examples due to the novelty of the approach and training examples often have to be created on demand by humans.

The alignment of multimodal models introduces ethical issues that previously did not even need to be considered. Should the model be able to comment on people's appearance, gender and race, or recognize who they are? Should I try to guess the location of the photographs? There are many more aspects to align compared to text-only data.

Multimodality provides new possibilities for how the model can be used, but also presents new challenges for model developers who need to ensure the safety, truthfulness, and usefulness of responses. With multimodality, more aspects need to be aligned, and getting good training data for SFT and RLHF is more challenging. Those looking to build or refine multimodal models must be prepared for these new challenges with development pipelines that incorporate high-quality human feedback.

Magdalena Konkiewicz is a data evangelist at Toloka, a global company supporting rapid and scalable ai development. She has a master's degree in artificial intelligence from the University of Edinburgh and has worked as an NLP engineer, developer and data scientist for companies in Europe and America. She has also been involved in teaching and mentoring data scientists and regularly contributes to publications on data science and machine learning.