Exploring possible use cases for Phi-3-Vision, a small but powerful MLLM that can run locally (with code examples)
Microsoft recently released Phi-3, a powerful language model, with a new Vision-Language variant called Phi-3-vision-128k-instruct. This 4B parameter model achieved impressive results in public benchmarks, even outperforming the GPT-4V in some cases and outperforming the Gemini 1.0 Pro V in all but MMMU.
This blog post will explore how you can use Phi-3-vision-128k-instruct as a robust text and vision model in your data science toolset. We will demonstrate its capabilities through several use cases, including:
- Optical Character Recognition (OCR)
- Image captions
- Table analysis
- Understanding the figure
- Reading comprehension in scanned documents
- Brand Set Request
We'll start by providing a simple code snippet to run this model locally using transformers and bitsandbytes. Then, we'll show an example for each of the use cases listed above.
Running the model locally:
Create a Conda Python environment and install torch and other Python dependencies:
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
pip install git+https://github.com/huggingface/transformers.git@60bb571e993b7d73257fb64044726b569fef9403 pillow==10.3.0 chardet==5.2.0 flash_attn==2.5.8 accelerate==0.30.1 bitsandbytes==0.43.1
Then, we can run this script:
# Example inspired from https://huggingface.co/microsoft/Phi-3-vision-128k-instruct# Import necessary libraries
from PIL import Image
import requests
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
from transformers import BitsAndBytesConfig
import torch
# Define model ID
model_id = "microsoft/Phi-3-vision-128k-instruct"
# Load processor
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Define BitsAndBytes configuration for 4-bit quantization
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load model with 4-bit quantization and map to CUDA
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cuda",
trust_remote_code=True,
torch_dtype="auto",
quantization_config=nf4_config,
)
# Define initial chat message with image placeholder
messages = ({"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"})
# Download image from URL
url = "https://images.unsplash.com/photo-1528834342297-fdefb9a5a92b?ixlib=rb-4.0.3&q=85&fm=jpg&crop=entropy&cs=srgb&dl=roonz-nl-vjDbHCjHlEY-unsplash.jpg&w=640"
image = Image.open(requests.get(url, stream=True).raw)
# Prepare prompt with image token
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
# Process prompt and image for model input
inputs = processor(prompt, (image), return_tensors="pt").to("cuda:0")
# Generate text response using model
generate_ids = model.generate(
**inputs,
eos_token_id=processor.tokenizer.eos_token_id,
max_new_tokens=500,
do_sample=False,
)
# Remove input tokens from generated response
generate_ids = generate_ids(:, inputs("input_ids").shape(1) :)
# Decode generated IDs to text
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)(0)
# Print the generated response
print(response)
This code will load the Phi-3 model as we would any transfomers
model. We add bitsandbytes
to fit into a consumer GPU memory.
We use a simple message `\nWhat is shown in this image?` where we reference the image and request a description of what it contains. This message is processed with the image (the same image as the thumbnail of this blog) and passed through the model. This results in the following output:
The image shows a single yellow flower with a green stem on a blue background.
Once the model was loaded, processing and prediction took 2 seconds on an RTX3080.
Now that we're all set up, let's explore some use cases:
Optical Character Recognition (OCR)
We want to transcribe the text of an image. To do so, change the message lines and image URL in the code snippet above with:
messages = ({"role": "user", "content": "<|image_1|>\nOCR the text of the image as is. OCR:"})
url = "https://i.postimg.cc/fRFvFVyv/Screenshot-from-2024-05-22-20-55-43.png"
Input:
Production:
3 EXPERIMENTS
We show that position interpolation can effectively extend the context window up to 32 times the original size, and such extension can be performed with only several hundred training steps. We show that the resulting models are robust LLMs with fully effective long context windows. We demonstrate its performance on a series of tasks including language modeling, password retrieval, and long documents.
document summary. We also present comparative results of the extended models on the original LLaMA evaluation benchmarks.
The result is as expected! Transcribing the full text took 6 seconds, which is a bit slow compared to normal OCR.
Image captions
The goal here is to caption a natural image:
messages = ({"role": "user", "content": "<|image_1|>\nWhat is this image of?"})
url = "https://i.postimg.cc/q7Wjtj3n/Screenshot-from-2024-05-22-20-56-56.png"
Input:
Production:
This image shows a stuffed toy that resembles a character from a popular animated television series. The toy has a round body with a hood and wears a pair of glasses. The character's face is not visible in the image.
Mostly correct (without glasses ). Still, it is striking that he recognized Kenny from an animated series.
Table analysis
We want to see if the model can parse the table into a format that is machine readable:
messages = ({"role": "user", "content": "<|image_1|>\nExtract the table content of the table in JSON format: "})
url = "https://i.postimg.cc/W1YwqXk8/Screenshot-from-2024-05-22-21-41-03.png"
Input:
Production:
```json
(
{
"Benchmark": "Phi-3 Vision-128K-In1",
"MMMU": 40.2,
"MMBench": 80.5,
"ScienceQA": 90.8
},
{
"Benchmark": "LlaVA-1.6 Vicuna-7B",
"MMMU": 34.2,
"MMBench": 76.3,
"ScienceQA": 70.6
},
{
"Benchmark": "QWEN-VI Chat",
"MMMU": 39.0,
"MMBench": 75.8,
"ScienceQA": 67.2
}
)
```
In fact, it is capable of extracting the content of the table, in a structured format, directly from the screenshot.
Understanding the figure
We ask the model to describe a figure:
messages = ({"role": "user", "content": "<|image_1|>\nWhat is this figure about? "})
url = "https://i.postimg.cc/Y0sRQR4Q/Screenshot-from-2024-05-22-21-47-39.png"
Input:
Production:
The figure is a bar chart comparing the performance of several machine learning models, with each model represented by a pair of stacked bars indicating the performance of the base model and the fitted model.
Brief but correct!
Reading comprehension in scanned documents
The goal is to ask questions about scanned or photographed text. This differs from Optical Character Recognition (OCR) in that we are not interested in simply transcribing the text. Instead, we want to evaluate whether the model can understand written content and answer questions about it directly.
messages = ({"role": "user", "content": "<|image_1|>\nWhat is this book about? "})
url = "https://i.postimg.cc/rmHxPbQY/Screenshot-from-2024-05-22-20-56-21.png"
Input:
Production:
The image shows the cover of a book titled 'SAM TALLENT RUNNING THE LIGHT' with a synopsis describing the story of a comedian named Billy Ray Schafer who faces personal and professional challenges. The book appears to be a work of fiction, possibly a novel, as it includes a quote from DOUG STANHOPE, ICONOCLAST, praising the book as brilliant and the best fictional representation of comedy.
The title of the book is incorrect, but MLLM was able to understand what the text in the image is about and summarize it in one go.
Brand Set Request
Mark Set Indication (SoM) uses interactive segmentation models to divide an image into regions and mark them with symbols, allowing large multimodal models to better understand and answer visually-based questions.
To keep things simple in this example, I marked the objects manually instead of using a model and then referenced the mark. (4) in my message:
messages = ({"role": "user", "content": "<|image_1|>\nWhat is object number 4? "})
url = "https://i.postimg.cc/fy0Lz798/scott-webb-p-0l-WFknspg-unsplash-2.jpg"
Input:
Object number 4 is a cactus with orange flowers in a pot.
The MLLM was able to understand my reference and answer my question accordingly.
There you have it! Phi-3-Vision is a powerful model for working with images and text, capable of understanding image content, extracting text from images, and even answering questions about what you see. While its small size, with only 4 billion parameters, may limit its suitability for tasks requiring strong linguistic skills, most models in its class are at least twice its size with 8B parameters or more, making which stands out for its efficiency. It shines in applications like document analysis, understanding table structure, and OCR in the wild. Its compact nature makes it ideal for deployment on peripheral devices or consumer local GPUs, especially after quantization. It will be my go-to model in all document analysis and understanding processes, as its zero-shot capabilities make it a capable tool, especially for its modest size. Next, I'll also work on some LoRA tuning scripts for this model to see how far I can take it for more specialized tasks.
References: