The world of large language models (LLM) is constantly evolving and new advances are emerging rapidly. An interesting area is the development of multimodal LLMs (MLLMs), capable of understanding and interacting with both texts and images. This opens up a world of possibilities for tasks like document comprehension, visual question answering, and more.
I recently wrote a general post about one of those models which you can check out here:
But in this one, we will explore a powerful combination: the InternVL model and the QLoRA fine-tuning technique. We will focus on how we can easily customize such models for any specific use case. We will use these tools to create a receipt understanding pipeline that extracts key information such as company name, address, and total purchase amount with high accuracy.
This project aims to develop a system that can accurately extract specific information from scanned receipts, using the capabilities of InternVL. The task presents a unique challenge, requiring not only strong natural language processing (NLP) but also the ability to interpret the visual layout of the input image. This will allow us to create a single, end-to-end, non-OCR pipeline that demonstrates strong generalization across complex documents.
To train and evaluate our model, we will use the SROIE data set. SROIE provides 1000 scanned receipt images, each annotated with key entities such as:
- Company: The name of the store or business.
- Date: The date of purchase.
- Address: The address of the store.
- Total: The total amount paid.
We will evaluate the performance of our model using a fuzzy similarity score, a metric that measures the similarity between predicted and actual entities. This metric ranges from 0 (irrelevant results) to 100 (perfect predictions).
InternVL is a family of multimodal LLMs from OpenGVLab, designed to excel in tasks involving images and text. Its architecture combines a vision model (such as InternViT) with a language model (such as InternLM2 or Phi-3). We will focus on the Mini-InternVL-Chat-2B-V1–5 variant, a smaller version that is well suited to running on consumer GPUs.
Strengths of InternVL:
- Efficiency: Its compact size allows for efficient training and inference.
- Accuracy: Despite being smaller, it achieves competitive performance in several benchmarks.
- Multimodal capabilities: seamlessly combines image and text understanding.
Demo: You can explore a live demo of InternVL here.
To further improve the performance of our model, we will use QLoRA, which is a tuning technique that significantly reduces memory consumption while preserving performance. Is that how it works:
- Quantization: The pre-trained LLM is quantized to 4-bit precision, reducing its memory footprint.
- Low-Rank Adapters (LoRA): Instead of modifying all parameters of the pre-trained model, LoRA adds small, trainable adapters to the network. These adapters capture task-specific information without requiring changes to the parent model.
- Efficient training: The combination of quantization and LoRA enables efficient tuning even on memory-limited GPUs.
Let's dive into the code. First, we will evaluate the basic performance of Mini-InternVL-Chat-2B-V1–5 without any tuning:
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)model = InternVLChatModel.from_pretrained(
args.path,
device_map={"": 0},
quantization_config=quant_config if args.quant else None,
torch_dtype=torch.bfloat16,
)
tokenizer = InternLM2Tokenizer.from_pretrained(args.path)
# set the max number of tiles in `max_num`
model.eval()
pixel_values = (
load_image(image_base_path / "X51005255805.jpg", max_num=6)
.to(torch.bfloat16)
.cuda()
)
generation_config = dict(
num_beams=1,
max_new_tokens=512,
do_sample=False,
)
# single-round single-image conversation
question = (
"Extract the company, date, address and total in json format."
"Respond with a valid JSON only."
)
# print(model)
response = model.chat(tokenizer, pixel_values, question, generation_config)
print(response)
The result:
```json
{
"company": "SAM SAM TRADING CO",
"date": "Fri, 29-12-2017",
"address": "67, JLN MENHAW 25/63 TNN SRI HUDA, 40400 SHAH ALAM",
"total": "RM 14.10"
}
```
This code:
- Load the model from the Hugging Face center.
- Loads a sample receipt image and converts it to a tensor.
- Ask a question asking the model to extract relevant information from the image.
- Runs the model and generates the extracted information in JSON format.
This zero-shot evaluation shows impressive results, achieving an average fuzzy similarity score of 74.24%. This demonstrates InternVL's ability to understand receipts and extract information without adjustments.
To further increase the accuracy, we will tune the model using QLoRA. This is how we implement it:
_data = load_data(args.data_path, fold="train")# Quantization Config
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = InternVLChatModel.from_pretrained(
path,
device_map={"": 0},
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
)
tokenizer = InternLM2Tokenizer.from_pretrained(path)
# set the max number of tiles in `max_num`
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
print("img_context_token_id", img_context_token_id)
model.img_context_token_id = img_context_token_id
model.config.llm_config.use_cache = False
model = wrap_lora(model, r=128, lora_alpha=256)
training_data = SFTDataset(
data=_data, template=model.config.template, tokenizer=tokenizer
)
collator = CustomDataCollator(pad_token=tokenizer.pad_token_id, ignore_index=-100)
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
print("img_context_token_id", img_context_token_id)
model.img_context_token_id = img_context_token_id
print("model.img_context_token_id", model.img_context_token_id)
train_params = TrainingArguments(
output_dir=str(BASE_PATH / "results_modified"),
num_train_epochs=EPOCHS,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
optim="paged_adamw_32bit",
save_steps=len(training_data) // 10,
logging_steps=len(training_data) // 50,
learning_rate=5e-4,
lr_scheduler_type="cosine",
warmup_steps=100,
weight_decay=0.001,
max_steps=-1,
group_by_length=False,
max_grad_norm=1.0,
)
# Trainer
fine_tuning = SFTTrainer(
model=model,
train_dataset=training_data,
dataset_text_field="###",
tokenizer=tokenizer,
args=train_params,
data_collator=collator,
max_seq_length=tokenizer.model_max_length,
)
print(fine_tuning.model.print_trainable_parameters())
# Training
fine_tuning.train()
# Save Model
fine_tuning.model.save_pretrained(refined_model)
This code:
- Load the model with quantization enabled.
- Wrap the model with LoRA, adding trainable adapters.
- Creates a data set from the SROIE data set.
- Defines training arguments such as learning rate, batch size, and epochs.
- Initialize a trainer to handle the training process.
- Train the model on the SROIE dataset.
- Save the adjusted model.
Here is a sample comparison between the base model and the adjusted QLoRA model:
Ground Truth: {
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2018",
"address": "NO 4,JALAN PERJIRANAN 10, TAMAN AIR BIRU, 81700 PASIR GUDANG, JOHOR.",
"total": "72.00"
}
Prediction Base: KO
```json
{
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2016",
"address": "JM092487-D",
"total": "67.92"
}
```
Prediction QLoRA: OK
{
"company": "YONG TAT HARDWARE TRADING",
"date": "13/03/2018",
"address": "NO 4, JALAN PERUBANAN 10, TAMAN AIR BIRU, 81700 PASIR GUDANG, JOHOR",
"total": "72.00"
}
After fine-tuning with QLoRA, our model achieves a remarkable 95.4% fuzzy similarity score, a significant improvement over initial performance (74.24%). This demonstrates the power of QLoRA to increase model accuracy without requiring massive computing resources (15 minute training on 600 samples on an RTX 3080 GPU).
We have successfully created a robust receipt understanding system using InternVL and QLoRA. This approach shows the potential of multimodal LLMs for real-world tasks such as document analysis and information extraction. In this example use case, we got 30 points on prediction quality using a few hundred examples and a few minutes of compute time on a consumer GPU.
You can find the full code implementation for this project. here.
The development of multimodal LLMs is just beginning and the future presents exciting possibilities. The area of automated document processing has immense potential in the era of MLLM. These models can revolutionize the way we extract information from contracts, invoices, and other documents, and require minimal training data. By integrating text and vision, they can analyze the layout of complex documents with unprecedented precision, paving the way for more efficient and intelligent information management.
The future of ai is multimodal, and InternVL and QLoRA are powerful tools that help us unlock its potential on a small IT budget.
Links:
Code: https://github.com/CVxTz/doc-llm
Data set source: https://rrc.cvc.uab.es/?ch=13&com=introduction
Dataset License: Licensed under a Creative Commons Attribution 4.0 International License.