Leveraging smaller LLMs for enhanced recovery-augmented generation (RAG)

Call-3.2–1 B-Instruct and LanceDB

Abstract: Retrieval Augmented Generation (RAG) combines large language models with external knowledge sources to produce more accurate and contextually relevant responses. This article explores how smaller language models (LLMs), such as the recently open sourced Meta 1 Billion model, can be used effectively to summarize and index large documents, thereby improving the efficiency and scalability of RAG systems. We provide a step-by-step guide, complete with code snippets, on how to summarize text fragments from a product documentation PDF and store them in a LanceDB database for efficient retrieval.

Introduction

Augmented Recovery-Generation is a paradigm that improves the capabilities of language models by integrating them with external knowledge bases. While large LLMs such as GPT-4 have demonstrated notable capabilities, they entail significant computational costs. Small LLMs offer a more resource-efficient alternative, especially for tasks such as text summarization and keyword extraction, which are crucial for indexing and retrieval in RAG systems.

In this article, we will demonstrate how to use a small LLM to:

Extract and summarize text from a PDF document.
Generate embeds for summaries and keywords.
Store data efficiently in a LanceDB database.
Use this for an effective RAG
Also an Agentic workflow for LLM autocorrect errors.

Using a smaller LLM dramatically reduces the cost of these types of conversions on large data sets and achieves similar benefits for simpler tasks as larger parameter LLMs and can be easily hosted on-premise or from the cloud with a minimum cost.

we will use ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/”>FLAME 3.2 1 billion parameter model, the smallest state-of-the-art LLM yet.

Improved LLM RAG (Image by author)

The problem with embedding plain text

Before delving into the implementation, it is essential to understand why embedding plain text from documents can be problematic in RAG systems.

Ineffective context capture

Embedding plain text from a page without a summary often leads to embeds that are:

High dimensional noise: Plain text may contain irrelevant information, formatting artifacts, or repetitive language that does not contribute to the understanding of the main content.
Key concepts diluted: Important concepts can be hidden within extraneous text, making embeds less representative of critical information.

Recovery inefficiency

When embeddings do not accurately represent key concepts in the text, the retrieval system may fail to:

Answer user queries effectively: Embeddings may not align well with query embeddings, resulting in poor retrieval of relevant documents.
Provide the correct context: Even if a document is retrieved, it may not provide the precise information the user is looking for due to noise in the embedding.

Solution: summary before embedding

Summarizing the text before generating embeds solves these problems as follows:

Key Distillation Information: The summary extracts the essential points and key words, eliminating unnecessary details.
Improving embedding quality: Embeddings generated from summaries are more focused and representative of the main content, improving retrieval accuracy.

Prerequisites

Before you begin, make sure you have the following installed:

Python 3.7 or higher
PyTorch
transformer library
Sentence Transformers
PyMuPDF (for PDF processing)
LanzaDB
A laptop with Min 6 GB or Colab GPU (T4 GPU will be enough) or similar

Step 1: Set up the environment

First, import all necessary libraries and configure logging for debugging and tracing.

import pandas as pd
import fitz  # PyMuPDF
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import lancedb
from sentence_transformers import SentenceTransformer
import json
import pyarrow as pa
import numpy as np
import re

Step 2: Define auxiliary functions

Creating the message

We define a function to create notifications compatible with the LLAMA 3.2 model.

def create_prompt(question):
    """
    Create a prompt as per LLAMA 3.2 format.
    """
    system_message = "You are a helpful assistant for summarizing text and result in JSON format"
    prompt_template = f'''
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
{system_message}<|eot_id|><|start_header_id|>user<|end_header_id|>
{question}<|eot_id|><|start_header_id|>assistant1231231222<|end_header_id|>
'''
    return prompt_template

Processing the message

This function processes the message using the model and the tokenizer. We are setting the temperature to 0.1 to make the model less creative (less trippy).

def process_prompt(prompt, model, tokenizer, device, max_length=500):
    """
    Processes a prompt, generates a response, and extracts the assistant's reply.
    """
    prompt_encoded = tokenizer(prompt, truncation=True, padding=False, return_tensors="pt")
    model.eval()
    output = model.generate(
        input_ids=prompt_encoded.input_ids.to(device),
        max_new_tokens=max_length,
        attention_mask=prompt_encoded.attention_mask.to(device),
        temperature=0.1  # More deterministic
    )
    answer = tokenizer.decode(output(0), skip_special_tokens=True)
    parts = answer.split("assistant1231231222", 1)
   if len(parts) > 1:
        words_after_assistant = parts(1).strip()
        return words_after_assistant
    else:
        print("The assistant's response was not found.")
        return "NONE"

Step 3: Load the model

We use the LLAMA 3.2 1B Instruct model for the summary. We are loading the model with bfloat16 to reduce memory and running it on NVIDIA Laptop GPU (NVIDIA GeForce RTX 3060 6 GB / NVIDIA-SMI Driver 555.58.02 / Cuda Build Tools Version 12.5, V12.5.40) on a system Linux operating system.

It would be better to host via vLLM or better exLLamaV2

model_name_long = "meta-llama/Llama-3.2-1B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name_long)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
log.info(f"Loading the model {model_name_long}")
bf16 = False
fp16 = True
if torch.cuda.is_available():
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        log.info("Your GPU supports bfloat16: accelerate training with bf16=True")
        bf16 = True
        fp16 = False
# Load the model
device_map = {"": 0}  # Load on GPU 0
torch_dtype = torch.bfloat16 if bf16 else torch.float16
model = AutoModelForCausalLM.from_pretrained(
    model_name_long,
    torch_dtype=torch_dtype,
    device_map=device_map,
)
log.info(f"Model loaded with torch_dtype={torch_dtype}")

Step 4: Read and process the PDF document

We extract text from each page of the PDF document.

file_path = './data/troubleshooting.pdf'
dict_pages = {}
# Open the PDF file
with fitz.open(file_path) as pdf_document:
    for page_number in range(pdf_document.page_count):
        page = pdf_document.load_page(page_number)
        page_text = page.get_text()
        dict_pages(page_number) = page_text
        print(f"Processed PDF page {page_number + 1}")

Step 5: Configure LanceDB and SentenceTransformer

We initialize the SentenceTransformer model to generate embeddings and configure LanceDB to store the data. We are using a PyArrow based schema for LanceDB tables.

Please note that keywords are not used now, but can be used for hybrid search, i.e. vector similarity search and text search, if required.

# Initialize the SentenceTransformer model
sentence_model = SentenceTransformer('all-MiniLM-L6-v2')
# Connect to LanceDB
db = lancedb.connect('./data/my_lancedb')
# Define the schema using PyArrow
schema = pa.schema((
    pa.field("page_number", pa.int64()),
    pa.field("original_content", pa.string()),
    pa.field("summary", pa.string()),
    pa.field("keywords", pa.string()),
    pa.field("vectorS", pa.list_(pa.float32(), 384)),  # Embedding size of 384
    pa.field("vectorK", pa.list_(pa.float32(), 384)),
))
# Create or connect to a table
table = db.create_table('summaries', schema=schema, mode='overwrite')

Step 6: Summarize and store data

We loop through each page, generate a summary and keywords, and store them along with embeds in the database.

# Loop through each page in the PDF
for page_number, text in dict_pages.items():
    question = f"""For the given passage, provide a long summary about it, incorporating all the main keywords in the passage.
Format should be in JSON format like below:
{{
    "summary": <text summary>,
    "keywords": <a comma-separated list of main keywords and acronyms that appear in the passage>,
}}
Make sure that JSON fields have double quotes and use the correct closing delimiters.
Passage: {text}"""
    
    prompt = create_prompt(question)
    response = process_prompt(prompt, model, tokenizer, device)
    
    # Error handling for JSON decoding
    try:
        summary_json = json.loads(response)
    except json.decoder.JSONDecodeError as e:
        exception_msg = str(e)
        question = f"""Correct the following JSON {response} which has {exception_msg} to proper JSON format. Output only JSON."""
        log.warning(f"{exception_msg} for {response}")
        prompt = create_prompt(question)
        response = process_prompt(prompt, model, tokenizer, device)
        log.warning(f"Corrected '{response}'")
        try:
            summary_json = json.loads(response)
        except Exception as e:
            log.error(f"Failed to parse JSON: '{e}' for '{response}'")
            continue
    
    keywords = ', '.join(summary_json('keywords'))
    
    # Generate embeddings
    vectorS = sentence_model.encode(summary_json('summary'))
    vectorK = sentence_model.encode(keywords)
    
    # Store the data in LanceDB
    table.add(({
        "page_number": int(page_number),
        "original_content": text,
        "summary": summary_json('summary'),
        "keywords": keywords,
        "vectorS": vectorS,
        "vectorK": vectorK
    }))
    
    print(f"Data for page {page_number} stored successfully.")

Using LLM to correct your results

When generating summaries and extracting keywords, LLMs can sometimes generate results that are not in the expected format, such as malformed JSON.

We can leverage the LLM itself to correct these results by asking it to correct the errors. This is shown in the code above.

# Use the Small LLAMA 3.2 1B model to create summary
for page_number, text in dict_pages.items():
    question = f"""For the given passage, provide a long summary about it, incorporating all the main keywords in the passage.
    Format should be in JSON format like below: 
    {{
        "summary": <text summary> example "Some Summary text",
        "keywords": <a comma separated list of main keywords and acronyms that appear in the passage> example ("keyword1","keyword2"),
    }}
    Make sure that JSON fields have double quotes, e.g., instead of 'summary' use "summary", and use the closing and ending delimiters.
    Passage: {text}"""
    prompt = create_prompt(question)
    response = process_prompt(prompt, model, tokenizer, device)
    try:
        summary_json = json.loads(response)
    except json.decoder.JSONDecodeError as e:
        exception_msg = str(e)
        # Use the LLM to correct its own output
        question = f"""Correct the following JSON {response} which has {exception_msg} to proper JSON format. Output only the corrected JSON.
        Format should be in JSON format like below: 
        {{
            "summary": <text summary> example "Some Summary text",
            "keywords": <a comma separated list of keywords and acronyms that appear in the passage> example ("keyword1","keyword2"),
        }}"""
        log.warning(f"{exception_msg} for {response}")
        prompt = create_prompt(question)
        response = process_prompt(prompt, model, tokenizer, device)
        log.warning(f"Corrected '{response}'")
        # Try parsing the corrected JSON
        try:
            summary_json = json.loads(response)
        except json.decoder.JSONDecodeError as e:
            log.error(f"Failed to parse corrected JSON: '{e}' for '{response}'")
            continue

In this code snippet, if the initial output of the LLM cannot be parsed as JSON, we ask the LLM again to correct the JSON. This self-correcting pattern improves the robustness of our pipeline.

Suppose the LLM generates the following malformed JSON:

{
    'summary': 'This page explains the installation steps for the product.',
    'keywords': ('installation', 'setup', 'product')
}

Trying to parse this JSON fails due to using single quotes instead of double quotes. We detect this error and ask the LLM to correct it:

exception_msg = "Expecting property name enclosed in double quotes"
question = f"""Correct the following JSON {response} which has {exception_msg} to proper JSON format. Output only the corrected JSON."""

The LLM then provides the corrected JSON:

{
    "summary": "This page explains the installation steps for the product.",
    "keywords": ("installation", "setup", "product")
}

By using the LLM to correct your own output, we ensure that the data is in the correct format for further processing.

Expanding autocorrect through LLM agents

This pattern of using the LLM to correct your results can be extended and automated through the use of LLM Agents. LLM agents can:

Automate error handling: Detects errors and autonomously decides how to correct them without explicit instructions.
Improve efficiency: Reduce the need for manual intervention or additional code for error correction.
Improve robustness: Continually learn from mistakes to improve future results.

LLM Agents act as intermediaries that manage the flow of information and handle exceptions intelligently. They can be designed to:

Analyze results and validate formats.
Reapply to the LLM with refined instructions when you find errors.
Record errors and corrections for future reference and model adjustment.

Approximate implementation:

Instead of manually catching exceptions and re-requesting them, an LLM agent could encapsulate this logic:

def generate_summary_with_agent(text):
    agent = LLMAgent(model, tokenizer, device)
    question = f"""For the given passage, provide a summary and keywords in proper JSON format."""
    prompt = create_prompt(question)
    response = agent.process_and_correct(prompt)
    return response

The LLMAgent class would handle initial processing, error detection, replay, and correction internally.

Now let's see how we can use embeddings to get an effective RAG pattern again using the LLM to aid in classification.

Retrieval and Generation: User Query Processing

This is the usual flow. We take the user's question and look for the most relevant summaries.

# Example usage
user_question = "Not able to manage new devices"
results = search_summary(user_question, sentence_model)

Preparation of retrieved summaries

We compile the retrieved abstracts into a list, associating each abstract with its page number for reference.

summary_list = ()
for idx, result in enumerate(results):
    summary_list.append(f"{result('page_number')}# {result('summary')}")

Classification of abstracts

We ask the language model to sort the retrieved summaries based on their relevance to the user's question and select the most relevant one. Again LLM is used to classify the summaries instead of K-nearest neighbor or cosine distance or other classification algorithms alone for contextual embedding (vector) matching.

question = f"""From the given list of summaries {summary_list}, rank which summary would possibly have \
the answer to the question '{user_question}'. Return only that summary from the list."""
log.info(question)

Extract the selected summary and generate the final response

We retrieve the original content associated with the selected summary and ask the language model to generate a detailed answer to the user's question using this context.

for idx, result in enumerate(results):
    if int(page_number) == result('page_number'):
        page = result('original_content')
        question = f"""Can you answer the query: '{user_question}' \
using the context below?
Context: '{page}'
"""
        log.info(question)
        prompt = create_prompt(
            question,
            "You are a helpful assistant that will go through the given query and context, think in steps, and then try to answer the query \
with the information in the context."
        )
        response = process_prompt(prompt, model, tokenizer, device, temperature=0.01)  # Less freedom to hallucinate
        log.info(response)
        print("Final Answer:")
        print(response)
        break

Workflow explained

Vectorization of user queries: The user question is converted to an embed using the same SentenceTransformer model used during indexing.
Similarity search: Query embedding is used to search the vector database (LanceDB) for the most similar summaries and return the Top 3

>>  From the VectorDB Cosine search and Top 3 nearest neighbour search result, 
prepended by linked page numbers

07:04:00 INFO:From the given list of summary ((
'112# Cannot place newly discovered device in managed state', 
'113# The passage discusses the troubleshooting steps for managing newly discovered devices on the NSF platform, specifically addressing issues with device placement, configuration, and deployment.',
'116# Troubleshooting Device Configuration Backup Issue')) rank which summary would possibly have the possible answer to the question Not able to manage new devices. Return only that summary from the list

3. Summary classification: The retrieved summaries are passed to the language model, which classifies them based on their relevance to the user's question.

>> Asking LLM to Select from the Top N based on context

07:04:01 INFO:Selected Summary ''113# The passage discusses the troubleshooting steps for managing newly discovered devices on the NSF (Network Systems and Functional Requirements) platform, specifically addressing issues with device placement, configuration, and deployment.''

4. Context recovery: The original content associated with the most relevant summary is retrieved by parsing the page number and obtaining the associated page from LanceDB.

07:04:01 INFO:Page number: 113
07:04:01 INFO:Can you answer the question or query or provide more deatils query:'Not able to manage new devices'         Using the context below
        context:'3 
Check that the server and client platforms are appropriately sized. ...
Failed SNMP communication between the server and managed device.
SNMP traps from managed devices are arriving at one server, 
or no SNMP traps are ....
'

5. Response generation: The language model generates a detailed answer to the user's question using the retrieved context.

Here is a sample output from a sample PDF I have used.

07:04:08 INFO:I'll go through the steps and provide more details to answer the query.

The query is: "Not able to manage new devices"

Here's my step-by-step analysis:

**Step 1: Check that the server and client platforms are appropriately sized**

The context mentions that the NSP Planning Guide is available, which implies that the NSP (Network Service Provider) has a planning process to ensure that the server and client platforms are sized correctly. This suggests that the NSP has a process in place to evaluate the performance and capacity of the server and client platforms to determine if they are suitable for managing new devices.

**Step 2: Check for resynchronization problems between the managed network and the NFM-P**

The context also mentions that resynchronization problems between the managed network and the NFM-P can cause issues with managing new devices. This implies that there may be a problem with the communication between the server and client platforms, which can prevent new devices from being successfully managed.

**Step 3: Check for failed SNMP communication between the server and managed device**

The context specifically mentions that failed SNMP communication between the server and managed device can cause issues with managing new devices. This suggests that there may be a problem with the communication between the server and the managed device, which can prevent new devices from being successfully managed.

**Step 4: Check for failed deployment of the configuration request**

The context also mentions that failed deployment of the configuration request can cause issues with managing new devices. This implies that there may be a problem with the deployment process, which can prevent new devices from being successfully managed.

**Step 5: Perform the following steps**

The context instructs the user to perform the following steps:

1. Choose Administration→NE Maintenance→Deployment from the XXX main menu.
2. The Deployment form opens, listing incomplete deployments, deployer, tag, state, and other information.

Based on the context, it appears that the user needs to review the deployment history to identify any issues that may be preventing the deployment of new devices.

**Answer**

Based on the analysis, the user needs to:

1. Check that the server and client platforms are appropriately sized.
2. Check for resynchronization problems between the managed network and the NFM-P.
3. Check for failed SNMP communication between the server and managed device.
4. Check for failed deployment of the configuration request.

By following these steps, the user should be able to identify and resolve the issues preventing the management of

Conclusion

We can efficiently summarize and extract keywords from large documents using a small LLM like LLAMA 3.2 1B Instruct. These summaries and keywords can be embedded and stored in a database such as LanceDB, enabling efficient retrieval for RAG systems that use LLM in the workflow and not just in generation.

References

Instruction model Meta LLAMA 3.2 1B
Sentence transformers
LanzaDB
PyMuPDF Documentation

Leveraging Smaller LLMs for Recovery Enhanced-Augmented Generation (RAG) was originally published on Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leveraging smaller LLMs for enhanced recovery-augmented generation (RAG)

Technical Terrence Team

Interactive Brokers Expands Crypto Services in 2024

Leave a Reply Cancel reply

Recommended.

The Ethereum Foundation is the victim of an MEV bot attack

Fly responsibly? Airlines face a storm over climate claims By Reuters

Microsoft reopens Windows 10 beta testing for 'new features' and improvements

ChatGPT live web browsing exits beta, DALL-E 3 enters beta

DJI Integra Goggles have an integrated battery to improve ergonomics

Categories

Important Links

Leveraging smaller LLMs for enhanced recovery-augmented generation (RAG)

Call-3.2–1 B-Instruct and LanceDB

Introduction

The problem with embedding plain text

Ineffective context capture

Recovery inefficiency

Solution: summary before embedding

Prerequisites

Step 1: Set up the environment

Step 2: Define auxiliary functions

Creating the message

Processing the message

Step 3: Load the model

Step 4: Read and process the PDF document

Step 5: Configure LanceDB and SentenceTransformer

Step 6: Summarize and store data

Using LLM to correct your results

Expanding autocorrect through LLM agents

Retrieval and Generation: User Query Processing

Preparation of retrieved summaries

Classification of abstracts

Extract the selected summary and generate the final response

Workflow explained

Conclusion

References

Related

Technical Terrence Team

Interactive Brokers Expands Crypto Services in 2024

Leave a Reply Cancel reply

Recommended.

The Ethereum Foundation is the victim of an MEV bot attack

Fly responsibly? Airlines face a storm over climate claims By Reuters

Microsoft reopens Windows 10 beta testing for 'new features' and improvements

ChatGPT live web browsing exits beta, DALL-E 3 enters beta

DJI Integra Goggles have an integrated battery to improve ergonomics

Categories

Important Links

Get daily news updates to your inbox!