Large Language Models have proved themselves to be a revolutionary technology. Numerous applications exploiting their capabilities have been already developed and many more are expected to come soon. One of the most interesting applications of Large Language Models is their deployment as intelligent assistants able to help human users in a variety of tasks. Chat models trained with instruction tuning and Reinforcement Learning from Human Feedback (RLHF) have shown very promising capabilities of following human instructions and carrying out the assigned tasks. However, they are limited in their applicability to language-only tasks.
Multimodal conversational models aim to unleash the power of Large Language Models to tackle problems that require combining natural language with other modalities to be solved. In particular, vision-language models have received increasing attention since the introduction of vision capabilities to GPT-4V. Empowering the natural language capabilities of GPT-4 with image understanding has led to a powerful chat assistant that can help users with tasks requiring both vision and language understanding. While the vision capabilities of GPT-4V are impressive, closed-source models limit the potential for research and experimentation with this amazing technology. Fortunately, some open-source models appeared bringing the power of vision language models to the community in an easily accessible and transparent way. These models also continue the trend of increased focus on computing and memory efficiency, a trend already seen for open-source Large Language Models. This is an important feature because it facilitates the widespread adoption of these models.
In this tutorial, I will walk through the process of creating a vision chat assistant using the LLaVA (Large Language and Vision Assistant) model introduced in the Visual Instruction Tuning paper. I will first give a brief introduction to the LLaVA model and its improvements before discussing a simple code implementation of a vision chat assistant using the code provided in the official repository. I will then present some examples I crafted to showcase the capabilities and limitations of the model.
LLaVA
The LLaVA model was introduced in the paper Visual Instruction Tuning, and then further improved in Improved Baselines with Visual Instruction Tuning (also referred to as LLaVA-1.5). The idea behind it is to extract visual embeddings from an image and treat them in the same way as embeddings coming from language tokens by feeding them to a Large Language Model. Intuitively, we can think that the image will be described with “words” that the language model will use to generate its answer. To choose the right “words” the model uses a pre-trained CLIP visual encoder to extract the visual embeddings and then projects them into the word embedding space of the language model. The latter operation is accomplished with a vision-language connector, which was originally chosen to be a simple linear layer in the first paper Visual Instruction Tuning, and later replaced with a more expressive Multilayer Perceptron (MLP) in Improved Baselines with Visual Instruction. The architecture of the model is depicted below.
One of the advantages of the method is that by using a pre-trained vision encoder and a pre-trained language model, only the vision-language connector (which is a lightweight module) must be learned from scratch. In particular, the training of LLava consists of two stages:
- Pre-training for feature alignment: both the pre-trained vision encoder and language model are frozen, and only the weights of the vision-language connector are updated. All training samples consist of text-image pairs packed into a single-turn conversation. This stage aims to train the vision-language connector to align the embeddings of the vision encoder with the text embeddings of the language model.
- Fine-tuning with visual instructions: in this stage, only the weights of the vision encoder are frozen while the vision-language connector and the language model are fine-tuned together. The model is fine-tuned on image-based instruction-following tasks. It is interesting to notice that some of this data has been created by using language-only GPT4 to create instruction-following samples from the caption of the images and the coordinates of the bounding boxes of the entities depicted.
Creating a vision chatbot using the code provided in the official repository is fairly easy. The repository also provides standardized chat templates that can be used to parse the inputs in the right format. Following the right format used in training is essential for the quality of the answer generated by the model. The exact template depends on the language model used. The template for LLaVA-1.5 with a pre-trained Vicuna language model will look like this:
A chat between a curious user and an artificial intelligence assistant. The
assistant gives helpful, detailed, and polite answers to the user's questions. USER: <im_start><image><im_end> User's prompt
ASSISTANT: Assistant answer
USER: Another prompt
The first few lines are the general system prompt used by the model. The special tokens <im_start>, <image>, and <im_end> are used to indicate where embeddings representing the image will be placed.
The chatbot can be defined in just one simple Python class.
class LLaVAChatBot:
def __init__(self,
model_path: str = 'liuhaotian/llava-v1.5-7b',
device_map: str = 'auto',
load_in_8_bit: bool = True,
**quant_kwargs) -> None:
self.model = None
self.tokenizer = None
self.image_processor = None
self.conv = None
self.conv_img = None
self.img_tensor = None
self.roles = None
self.stop_key = None
self.load_models(model_path,
device_map=device_map,
load_in_8_bit=load_in_8_bit,
**quant_kwargs)def load_models(self, model_path: str,
device_map: str,
load_in_8_bit: bool,
**quant_kwargs) -> None:
"""Load the model, processor and tokenizer."""
quant_cfg = BitsAndBytesConfig(**quant_kwargs)
self.model = LlavaLlamaForCausalLM.from_pretrained(model_path,
low_cpu_mem_usage=True,
device_map=device_map,
load_in_8bit=load_in_8_bit,
quantization_config=quant_cfg)
self.tokenizer = AutoTokenizer.from_pretrained(model_path,
use_fast=False)
vision_tower = self.model.get_vision_tower()
vision_tower.load_model()
vision_tower.to(device='cuda')
self.image_processor = vision_tower.image_processor
disable_torch_init()
def setup_image(self, img_path: str) -> None:
"""Load and process the image."""
if img_path.startswith('http') or img_path.startswith('https'):
response = requests.get(img_path)
self.conv_img = Image.open(BytesIO(response.content)).convert('RGB')
else:
self.conv_img = Image.open(img_path).convert('RGB')
self.img_tensor = self.image_processor.preprocess(self.conv_img,
return_tensors='pt'
)('pixel_values').half().cuda()
def generate_answer(self, **kwargs) -> str:
"""Generate an answer from the current conversation."""
raw_prompt = self.conv.get_prompt()
input_ids = tokenizer_image_token(raw_prompt,
self.tokenizer,
IMAGE_TOKEN_INDEX,
return_tensors='pt').unsqueeze(0).cuda()
stopping = KeywordsStoppingCriteria((self.stop_key),
self.tokenizer,
input_ids)
with torch.inference_mode():
output_ids = self.model.generate(input_ids,
images=self.img_tensor,
stopping_criteria=(stopping),
**kwargs)
outputs = self.tokenizer.decode(
output_ids(0, input_ids.shape(1):)
).strip()
self.conv.messages(-1)(-1) = outputs
return outputs.rsplit('</s>', 1)(0)
def get_conv_text(self) -> str:
"""Return full conversation text."""
return self.conv.get_prompt()
def start_new_chat(self,
img_path: str,
prompt: str,
do_sample=True,
temperature=0.2,
max_new_tokens=1024,
use_cache=True,
**kwargs) -> str:
"""Start a new chat with a new image."""
conv_mode = "v1"
self.setup_image(img_path)
self.conv = conv_templates(conv_mode).copy()
self.roles = self.conv.roles
first_input = (DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN +
DEFAULT_IM_END_TOKEN + '\n' + prompt) # f"{self.roles(0)}: {prompt}")
self.conv.append_message(self.roles(0), first_input)
self.conv.append_message(self.roles(1), None)
if self.conv.sep_style == SeparatorStyle.TWO:
self.stop_key = self.conv.sep2
else:
self.stop_key = self.conv.sep
answer = self.generate_answer(do_sample=do_sample,
temperature=temperature,
max_new_tokens=max_new_tokens,
use_cache=use_cache,
**kwargs)
return answer
def continue_chat(self,
prompt: str,
do_sample=True,
temperature=0.2,
max_new_tokens=1024,
use_cache=True,
**kwargs) -> str:
"""Continue the existing chat."""
if self.conv is None:
raise RuntimeError("No existing conversation found. Start a new"
"conversation using the `start_new_chat` method.")
self.conv.append_message(self.roles(0), prompt)
self.conv.append_message(self.roles(1), None)
answer = self.generate_answer(do_sample=do_sample,
temperature=temperature,
max_new_tokens=max_new_tokens,
use_cache=use_cache,
**kwargs)
return answer
If you are familiar with the transformers library, you will recognize many of the usual features, and the operations performed should be straightforward to understand. Let’s go quickly over the methods of the LLaVAChatBot class defined above.
- load_models: this method loads the language models, the tokenizer, and the image processor with the specified parameters for quantization using the BitsAndBytes library. The code shadows the from_pretrained method used by Hugging Face transformers models. BitsAndBytes allows quantizing to model to 8bit or 4bit for reduced GPU memory requirements.
- setup_image: it loads the image from a local path or a URL and converts it to a tensor using the image processor.
- generate_answer: this method returns the model’s answer continuing the current conversation about the provided image. Again the generate method of the LLaVa model is analogous to the generate method of Hugging Face transformers models.
- get_conv_text: this method returns the raw text of the conversation so far.
- start_new_chat: this is one of the two main methods of the chatbot, it is used to start a new chat with the model. It creates a new conversation given the image and the initial prompt to pass to the model. It takes care of setting up the conversation using the templates defined in the repository following the format discussed in the previous section.
- continue_chat: the other main method, it continues an existing conversation about an image.
I have provided the full code in this Colab notebook so that you can try it yourself.
In this section, I collect some interesting examples obtained while testing the capabilities of the model. All the examples in this post were created using the model llava-v1.5–7b with 8-bit quantization.
For a start, let’s look at the model’s capabilities to describe and understand images. Below I asked the model to describe an image picturing a white tiger.
A chat between a curious user and an artificial intelligence assistant. The
assistant gives helpful, detailed, and polite answers to the user's questions. USER: <im_start><image><im_end>
Describe the image in details.
ASSISTANT: The image features a white tiger laying down on a rocky surface,
possibly a stone wall or a rock. The tiger appears to be relaxed and
comfortable in its environment. The tiger's position is such that it covers a
significant portion of the image, from the left side to almost the entire right
side. The rocky surface provides a natural and rugged backdrop for the scene.</s></s>
USER: What is unique about this tiger?
ASSISTANT: This tiger is unique because it is a white tiger, which is an
extremely rare and visually striking variation of the species. White tigers are
born with a genetic mutation that causes their fur to be predominantly white,
with the rest of their body being a mix of orange and black stripes. This
distinctive appearance sets them apart from the more common orange and black
striped tigers.</s></s>
The model seems to be able to describe the image in detail and also to reason about the subject of the image, correctly assessing that a white tiger is a rare occurrence.
Next, I tested if the model is able to provide the user with actionable instructions to perform various tasks. This is what distinguishes a “captioning” model from a true vision-chat assistant. For example, can the model provide recipes from the ingredients present in an image? The result is reported below.
A chat between a curious user and an artificial intelligence assistant. The
assistant gives helpful, detailed, and polite answers to the user's questions.
USER: <im_start><image><im_end>
What can I cook with these ingredients? ASSISTANT: With the ingredients displayed on the table, you can prepare a
delicious roast dinner. The meat, which is a large piece of beef, can be
seasoned and cooked in the oven, while the vegetables, including carrots,
potatoes, and onions, can be roasted alongside the meat. You can also prepare a
side dish with the potatoes, carrots, and onions, which can be served alongside
the roast dinner. The presence of a bowl of sauce suggests that you can use it
as a dipping sauce or to enhance the flavors of the roast and vegetables.
Overall, this combination of ingredients offers a hearty and flavorful meal
that can be enjoyed with family and friends.</s></s>
USER: Could the recipe you suggested be included in a vegetarian diet?
ASSISTANT: The recipe I suggested, which involves roasting a large piece of
beef and vegetables, is not suitable for a vegetarian diet. The beef is the
main protein source in the dish, and it is not suitable for vegetarians who
avoid consuming meat. However, you can still create a vegetarian-friendly meal
using the vegetables and sauce availableon the table. For example, you can
prepare a vegetable roast with the carrots, potatoes, and onions, and serve it
alongside a side of mashed potatoes or a vegetable-based sauce. This way, you
can still enjoy the flavors and textures of the dish without including meat.</s></s>
USER: Can you suggest a vegetarian recipe with minimal modifications of the
depicted ingredients?
ASSISTANT: Certainly! Here's a vegetarian recipe that can be made using the
ingredients available on the table:
Roasted Vegetable Medley with Sauce
Ingredients:
* Carrots
* Potatoes
* Onions
* Olive oil
* Salt
* Pepper
* Herbs (such as thyme, rosemary, or parsley)
* Sauce (optional)
Instructions:
1. Preheat your oven to 400°F (200°C).
2. Wash and peel the carrots, potatoes, and onions. Cut them into bite-sized
pieces.
3. In a large bowl, toss the vegetables with olive oil, salt, and pepper.
4. Spread the vegetables evenly on a baking sheet lined with parchment paper.
5. Roast the vegetables in the preheated oven for 20-25 minutes, or until they
are tender and slightly browned.
6. If desired, prepare a sauce by combining ingredients such as tomatoes,
garlic, and herbs. You can use this sauce as a dipping sauce or to enhance the
flavors of the roasted vegetables.
7. Serve the roasted vegetable medley alongside a side of mashed potatoes or a
vegetable-based sauce.
This vegetarian recipe allows you to enjoy the flavors and textures of the
roasted vegetables without including meat. The sauce can be easily customized
to suit your taste preferences, and the dish can be served alongside a variety
of side dishes, such as mashed potatoes or a vegetable-based sauce.</s></s>
The model seems to have a good understanding of the ingredients present in the image and it is also capable of connecting the individual words to more abstract concepts (i.e. a meal with meat is not vegetarian).
Let’s now see how the model can be used to get pieces of advice. This is a useful feature to get new ideas when brainstorming and can provide real value to the user. In the conversation below, I asked the model for suggestions to change the look of a café.
A chat between a curious user and an artificial intelligence assistant. The
assistant gives helpful, detailed, and polite answers to the user's questions. USER: <im_start><image><im_end>
How can I renovate this café to make it look more vintage?
ASSISTANT: To make the café look more vintage, you can consider the following
changes:
1. Update the lighting: Replace the current lighting with vintage-style lamps
or lanterns, which will create a warm and cozy atmosphere.
2. Add vintage decorations: Incorporate antique items, such as vintage clocks,
artwork, or decorative plates, to create a nostalgic ambiance.
3. Use vintage furniture: Replace the current stools with vintage-style chairs
or barstools, which can add character and charm to the space.
4. Incorporate greenery: Add more greenery, such as potted plants or vines, to
create a lush and inviting atmosphere.
5. Update the bar: Replace the current bar with a vintage-style bar, complete
with a vintage-inspired backsplash and countertop.
6. Use vintage-style glassware: Replace the current glassware with vintage-style
glasses, such as wine glasses, beer mugs, or cocktail shakers.
7. Add vintage-style signage: Replace the current signage with vintage-style
menu boards or chalkboards, which can add a touch of nostalgia to the space.
8. Use vintage-style wallpaper or paint: Apply a vintage-style wallpaper or
paint to the walls, which can create a unique and charming atmosphere.
By implementing these changes, the café will appear more vintage and inviting,
creating a memorable experience for its patrons.</s></s>
The advice looks on point, although quite generic, identifying the current elements that could be changed to give a more vintage style.
Let’s continue with something practical that can aid the users in their daily activities. Can the model help with writing advertisements or listings?
A chat between a curious user and an artificial intelligence assistant. The
assistant gives helpful, detailed, and polite answers to the user's questions. USER: <im_start><image><im_end>
Write an advertisment for this chair
ASSISTANT: Introducing the modern and stylish black and wood chair, perfect for
any contemporary setting. This chair features a sleek black seat and a wooden
base, creating a striking contrast against any background. Its minimalist
design makes it a versatile addition to any room, whether it's used as a dining
chair, a side table, or a statement piece in a living space. The chair's sturdy
construction ensures it can withstand daily use, while its clean lines and
sleek appearance add a touch of sophistication to any room. Upgrade your living
space with this stylish and functional chair today.</s></s>
USER: Write now an advertisement for the same chair but targeting cafes
ASSISTANT: Attention café owners and design enthusiasts! Introducing the black
and wood chair, the perfect addition to your café's seating area. This stylish
and functional chair offers a modern touch to your establishment, with its
sleek black seat and wooden base creating a striking contrast against any
background. Its minimalist design makes it a versatile choice for any café
setting, whether it's used as a dining chair, a side table, or a statement
piece in a lounge area. The chair's sturdy construction ensures it can
withstand daily use, while its clean lines and sleek appearance add a touch of
sophistication to any room. Upgrade your café's ambiance with this stylish and
functional chair today.</s></s>
The model is able to pinpoint the features of the chair correctly and creates an appealing advertisement, correctly customizing the answer to different targets when asked to.
As a last test, let us see if prompting techniques that have been demonstrated to be beneficial for Large Language Models can be used with LLaVA as well. Below is an example of attempting zero-shot Chain of Thought prompting.
A chat between a curious user and an artificial intelligence assistant. The
assistant gives helpful, detailed, and polite answers to the user's questions. USER: <im_start><image><im_end>
Would the fish depicted in the image float if thrown in water? Think step by
step.
ASSISTANT: The fish depicted in the image is a large white fish sculpture or
statue, which is lying on the ground on a brick walkway near a body of water.
If thrown into the water, the fish would not float because it is a solid,
inanimate object. It would sink to the bottom of the water, as it is not
buoyant due to its composition and size. The fish's position on the ground and
its lack of buoyancy make it impossible for it to float in water.</s></s>
USER: What material could be used to make a fish sculpture that floats on
water?
ASSISTANT: To make a fish sculpture that floats on water, one could use a
lightweight, buoyant material such as expanded polystyrene foam or polyethylene
foam. These materials are commonly used for making buoyant objects like life
jackets, flotation devices, and fishing floats. By shaping the foam into the
desired fish form, it would be able to float on the water's surface,
simulating the behavior of a real fish.</s></s>
The model is able to explain why the fish would not float, correctly identifying that it is not a real fish. It also follows the instruction to think step by step before giving the final answer. For comparison, here is an example of the conversation without the “Think step by step” prompt.
A chat between a curious user and an artificial intelligence assistant. The
assistant gives helpful, detailed, and polite answers to the user's questions.USER: <im_start><image><im_end>
Would the fish depicted in the image float if thrown in water?
ASSISTANT: No, the fish depicted in the image would not float if thrown in
water. It is a large, white, stone fish sculpture, which is not a real fish
and does not have the buoyancy properties of a natural fish.</s></s>
While the final answer is still correct, the explanation given in this case is much less detailed and the answer is given at the start of the sentence. Since the generative language model used by LLaVA is causal, this means that the provided answer does not rely on the subsequent explanation. It would be interesting to carry out more extensive experiments to test if Chain of Thought (CoT) can, in general, improve vision-language models’ performance in complex reasoning tasks similar to what has been observed for Large Language Models in Chain-of-Thought Prompting Elicits Reasoning in Large Language Models and Large Language Models are Zero-Shot Reasoners.
While very powerful, LLaVA comes with its limitations. For example, the model has been trained to use only one image per chat so it isn’t able to handle more complex conversations that need interactions with multiple images. It is worth noticing, however, that this limitation is not intrinsic to the architecture of the model and can be amended by including multi-image chat data during training.
During the tests, I noticed that the model (at least the 8bit quantized version of the 1.5 7b model) has trouble with counting elements in an image and in general with handling numerical tasks. Below is an example.
A chat between a curious user and an artificial intelligence assistant. The
assistant gives helpful, detailed, and polite answers to the user's questions. USER: <im_start><image><im_end>
How many car are inside the parking lot?
ASSISTANT: There are several cars parked inside the parking lot.</s></s>
USER: List the numbers of the parking spots occupied by cars
ASSISTANT: The parking spots occupied by cars are 66, 67, 68, 69, 70, and 71.</s></s>
The model, like all Large Language Models, is also prone to hallucinations and can be easily tricked into making mistakes using appropriate prompts. For example, if provided with a picture of the desert and asked what species is the fish in the image, it hallucinates a desert fish.
A chat between a curious user and an artificial intelligence assistant. The
assistant gives helpful, detailed, and polite answers to the user's questions. USER: <im_start><image><im_end>
Tell me the species of the fish in this image
ASSISTANT: The fish in this image is a desert fish, which is a species adapted
to living in desert environments.</s></s>
LLaVA shows impressive capabilities in vision-language understanding. It marks a clear step forward for multimodal open-source vision-language models. One of the biggest advantages of LLaVA is that it is lightweight to train and fine-tune. For instance, the full training of LLaVA 1.5 13b took only 1.2M data and roughly 1 day on a single 8-A100 node. This makes it suitable for fine-tuning on specific domains to get an expert assistant, as was done, for example, in LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day.
Adding vision capabilities to chat assistants expands the area of applications of such models, bringing their revolutionizing potential to more complex and nuanced tasks. Treating image features as language tokens also brings up the possibility of using all the advanced prompting techniques used with text-only language models and further expands them. For example, one could expand the power of Retrieval Augmented Generation by retrieving both texts and images that are relevant to the conversation. In fact, using the shared image-text embedding space of CLIP it is possible to retrieve both external documents and external images starting with either an input text or picture!
Another interesting direction to expand the capabilities of the model is presented in LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing. The main idea is to combine the various capabilities of vision-language chat models, text-to-image generative models, and other vision models (such as image segmentation models) to get an assistant capable of handling multimodal inputs and generating multimodal outputs.
In conclusion, LLaVA marked an important step for open-source multimodal generative models, which have shown impressive capabilities and are attracting a lot of interest. With the more widespread adoption of open-source models, I believe we will soon witness a rapid increase in new applications of these powerful models.
Thank you for reading! If you want to try out the code yourself you can look at this Colab notebook.