Build an audio rag with Assemyai, Qrant and Deepseek-R1

Tired of manually examining audio hours to find key ideas? This guide teaches you to build a chatbot with ai that transforms recordings (meetings, podcasts, interviews) into interactive conversations. Use of Assemyai for a precise transcription with speakers labels, QDRANT for the storage of fast data and Deepseek-R1 through Sambanova Cloud for smart answers, will create a rag tool that answers questions such as “What said (speaker)?” either “Summarize this segment.” We convert your audio into an ai search dialogue through the construction of a rag system with Assemyai, Qrant and Deepseek-R1.

Learning objectives

Take advantage of the assembly API to transcribe audio files with speaker newspaper, converting conversations into structured text data for the analysis.
Implement the QDRANT Vector Database to efficiently store and recover the incrustations of transcribed audio content using HuggingFace models.
Implement the rag with the Deepseek R1 model through Sambanova Cloud to generate answers from context conscious.
Create a streamlit web interface for users to load audio files, visualize transcripts and interact with chatbot in real time.
Integrate end -to -end workflow that combines audio processing, vector storage and IA -based response generation to create a scalable audio -based chat application.

This article was published as part of the Blogathon of Data Sciences.

What is Assemyai?

Assemyai is your reference tool to turn audio into processable ideas. Whether you are transcribing podcasts, analyzing customer calls or subtitulating videos, your voice engine with text offers tip precision, even with accents or background noise.

RAG AUDIO WITH ASSEMYAI, QDRANT AND DEPEEEK-R1

What is Sambanova Cloud?

Imagine to execute massive open source models such as Deepseek-R1 (671b) up to 10 times faster, and without usual infrastructure headaches.

Instead of trusting the GPU, Sambanova usrdus (reconfigurable data flow units), which unlock a faster performance with:

Mass storage in memory: Without constant recharge of models
Efficient data flow design: Optimized for high performance tasks
Instant models switching: Change between microsecond models
Execute Deepseek-R1 instantly, complicated configuration is not required
Train and adjust on the same platform, all in one place

What is QDRANT?

QDRANT is a Lightning-FAST vector database built to overcome ai applications, think about it as a search engine that finds needles in Paystacks. Whether you are creating a recommendation system, image search tool or chatbot, QDRONT specializes in similarity searches, quickly identifying the closest coincidences for complex data such as text integrations or visual characteristics.

What is Deepseek-R1?

Deepseek-R1 is a language model that changes the game that combines human adaptability with the avant-garde, which makes it an outstanding one in the processing of natural language. Whether you are creating content, translate languages, purify code or summarize complex reports, R1 stands out to understand the context, tone and intention, delivering answers that feel intuitive instead of robotics. By prioritizing empathy and precision, Depseek-R1 is not just a tool; It is a look at a future where ai communicates as naturally as us.

Building the rag model with Assemyai and Deepseek-R1

Now that it includes all the components, we immerse to build our rag. But before doing that, we quickly cover what you will need to start.

1. required previous requirements

Below are the required prerequisites:

Clone the repository:

git clone https://github.com/karthikponna/chat_with_audios.git
cd chat_with_audios

Create and activate the virtual environment:

# For macOS and Linux:
python3 -m venv venv
source venv/bin/activate

# For Windows:
python -m venv venv
.\venv\Scripts\activate

Install required units:

pip install -r requirements.txt

Configure environment variables:

Create an `.env` file and add your Assembly and <a target="_blank" href="https://cloud.sambanova.ai/apis” target=”_blank” rel=”nofollow noopener”>Sambanova API keys.

ASSEMBLYAI_API_KEY="your_assemblyai_api_key_string"
SAMBANOVA_API_KEY="your_sambanova_api_key_string"

Now let's start with the coding part.

2. Increased generation recovery

RAG fuses large language models with external data to produce more precise answers in context. He obtains relevant information at the time of consultation, ensuring that the answers depend on real data instead of only training in models.

2.1 Import necessary libraries

We believe a file called Rag_code.py. We will walk the step by step, starting with the importation of the necessary modules and orchestrating the architecture of the code using the <a target="_blank" href="https://www.llamaindex.ai/” target=”_blank” rel=”nofollow noopener”>Flame index.

from qdrant_client import models
from qdrant_client import QdrantClient
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.sambanovasystems import SambaNovaCloud
from llama_index.llms.ollama import Ollama
import assemblyai as aai
from typing import List, Dict

from llama_index.core.base.llms.types import (
    ChatMessage,
    MessageRole,
)

2.2 Batch processing and embedding with a hug face

Here the Batch_iterate function divides a text list into smaller pieces, which facilitates processing large data sets. The incredible class then loads an embedded face inlaid model, generates inlays for each text batch and collects these incrustations for subsequent use.

def batch_iterate(lst, batch_size):
    """Yield successive n-sized chunks from lst."""
    for i in range(0, len(lst), batch_size):
        yield lst(i : i + batch_size)

class EmbedData:

    def __init__(self, embed_model_name="BAAI/bge-large-en-v1.5", batch_size = 32):
        self.embed_model_name = embed_model_name
        self.embed_model = self._load_embed_model()
        self.batch_size = batch_size
        self.embeddings = ()
        
    def _load_embed_model(self):
        embed_model = HuggingFaceEmbedding(model_name=self.embed_model_name, trust_remote_code=True, cache_folder="./hf_cache")
        return embed_model

    def generate_embedding(self, context):
        return self.embed_model.get_text_embedding_batch(context)
        
    def embed(self, contexts):
        
        self.contexts = contexts
        
        for batch_context in batch_iterate(contexts, self.batch_size):
            batch_embeddings = self.generate_embedding(batch_context)
            self.embeddings.extend(batch_embeddings)

2.3 QDRANT VECTOR DATABASE Settings and ingestion

The QDRANTVDB_QB class initializes an QDRANT Vector database configuring key parameters such as the name of the collection, the vector dimension and the lot size, and connects to QDRONT while verifying an existing collection (creating a if necessary).
It is charged efficiently using the text contexts with their corresponding inlays and then updating the collection configuration accordingly.

class QdrantVDB_QB:

    def __init__(self, collection_name, vector_dim = 768, batch_size=512):
        self.collection_name = collection_name
        self.batch_size = batch_size
        self.vector_dim = vector_dim
        
    def define_client(self):
        
        self.client = QdrantClient(url="http://localhost:6333", prefer_grpc=True)
        
    def create_collection(self):
        
        if not self.client.collection_exists(collection_name=self.collection_name):

            self.client.create_collection(collection_name=f"{self.collection_name}",
                                          
                                          vectors_config=models.VectorParams(size=self.vector_dim,
                                                                             distance=models.Distance.DOT,
                                                                             on_disk=True),
                                          
                                          optimizers_config=models.OptimizersConfigDiff(default_segment_number=5,
                                                                                        indexing_threshold=0),
                                          
                                          quantization_config=models.BinaryQuantization(
                                                        binary=models.BinaryQuantizationConfig(always_ram=True)),
                                         )
            
    def ingest_data(self, embeddata):
    
        for batch_context, batch_embeddings in zip(batch_iterate(embeddata.contexts, self.batch_size), 
                                                    batch_iterate(embeddata.embeddings, self.batch_size)):
    
            self.client.upload_collection(collection_name=self.collection_name,
                                          vectors=batch_embeddings,
                                          payload=({"context": context} for context in batch_context))

        self.client.update_collection(collection_name=self.collection_name,
                                      optimizer_config=models.OptimizersConfigDiff(indexing_threshold=20000)
                                     )

2.4 Consult entrusting retriever

The retriever class is designed to close the gap between user consultations and a vector database initializing with a vector database customer and an inlaid model.
Your search method transforms a consultation into an inlaid using the model, then performs a vector search in the database with tune in tune parameters to quickly recover the relevant results.

class Retriever:

    def __init__(self, vector_db, embeddata):
        
        self.vector_db = vector_db
        self.embeddata = embeddata

    def search(self, query):
        query_embedding = self.embeddata.embed_model.get_query_embedding(query)
        
        
        result = self.vector_db.client.search(
            collection_name=self.vector_db.collection_name,
            
            query_vector=query_embedding,
            
            search_params=models.SearchParams(
                quantization=models.QuantizationSearchParams(
                    ignore=False,
                    rescore=True,
                    oversampling=2.0,
                )
            ),
            
            timeout=1000,
        )

        return result

2.5 Smart rag consultation assistant

The RAG class integrates a retriever and a LLM to generate answers with the context. Recover relevant information from a vector database, format it in a structured message and send it to the LLM for an answer. I am using Sambanovacloud to access the LLM model through its API for an efficient text generation.

class RAG:

    def __init__(self,
                 retriever,
                 llm_name = "Meta-Llama-3.1-405B-Instruct"
                 ):
        
        system_msg = ChatMessage(
            role=MessageRole.SYSTEM,
            content="You are a helpful assistant that answers questions about the user's document.",
        )
        self.messages = (system_msg, )
        self.llm_name = llm_name
        self.llm = self._setup_llm()
        self.retriever = retriever
        self.qa_prompt_tmpl_str = ("Context information is below.\n"
                                   "---------------------\n"
                                   "{context}\n"
                                   "---------------------\n"
                                   "Given the context information above I want you to think step by step to answer the query in a crisp manner, incase case you don't know the answer say 'I don't know!'.\n"
                                   "Query: {query}\n"
                                   "Answer: "
                                   )

    def _setup_llm(self):

        return SambaNovaCloud(
                        model=self.llm_name,
                        temperature=0.7,
                        context_window=100000,
                    )

        # return Ollama(model=self.llm_name,
        #               temperature=0.7,
        #               context_window=100000,
        #             )

    def generate_context(self, query):

        result = self.retriever.search(query)
        context = (dict(data) for data in result)
        combined_prompt = ()

        for entry in context(:2):
            context = entry("payload")("context")

            combined_prompt.append(context)

        return "\n\n---\n\n".join(combined_prompt)

    def query(self, query):
        context = self.generate_context(query=query)
        
        prompt = self.qa_prompt_tmpl_str.format(context=context, query=query)

        user_msg = ChatMessage(role=MessageRole.USER, content=prompt)

        # self.messages.append(ChatMessage(role=MessageRole.USER, content=prompt))
                
        streaming_response = self.llm.stream_complete(user_msg.content)
        
        return streaming_response

2.6 Audio transcription

Here transcribes the class is initialized by establishing the API Assemyai key and creating a transcriptor. Then process an audio file using a configuration that allows speakers labels, ultimately returning a list of dictionaries where each entry assigns a speaker to its transcribed text.

class Transcribe:
    def __init__(self, api_key: str):
        """Initialize the Transcribe class with AssemblyAI API key."""
        aai.settings.api_key = api_key
        self.transcriber = aai.Transcriber()
        
    def transcribe_audio(self, audio_path: str) -> List(Dict(str, str)):
        """
        Transcribe an audio file and return speaker-labeled transcripts.
        
        Args:
            audio_path: Path to the audio file
            
        Returns:
            List of dictionaries containing speaker and text information
        """
        # Configure transcription with speaker labels
        config = aai.TranscriptionConfig(
            speaker_labels=True,
            speakers_expected=2  # Adjust this based on your needs
        )
        
        # Transcribe the audio
        transcript = self.transcriber.transcribe(audio_path, config=config)
        
        # Extract speaker utterances
        speaker_transcripts = ()
        for utterance in transcript.utterances:
            speaker_transcripts.append({
                "speaker": f"Speaker {utterance.speaker}",
                "text": utterance.text
            })
            
        return speaker_transcripts

3. Transmission application

Streamlit is a Python library that transforms data scripts into interactive web applications, which makes it perfect for LLM -based solutions.

The following code creates an easy -to -use application that allows users to load an audio file, see their transcription and chat accordingly.
Assemyai transcribes the audio loaded into text marked with speaker.
The transcription is integrated and stored in a QDRant vector database for efficient recovery.
A retriever matched with a RAG engine generates chat responses with the context using these incrustations.
Session State manages chat history and storage in file cache to guarantee an experience without problems.

import os
import gc
import uuid
import tempfile
import base64
from dotenv import load_dotenv
from rag_code import Transcribe, EmbedData, QdrantVDB_QB, Retriever, RAG
import streamlit as st

if "id" not in st.session_state:
    st.session_state.id = uuid.uuid4()
    st.session_state.file_cache = {}

session_id = st.session_state.id
collection_name = "chat with audios"
batch_size = 32

load_dotenv()

def reset_chat():
    st.session_state.messages = ()
    st.session_state.context = None
    gc.collect()

with st.sidebar:
    st.header("Add your audio file!")
    
    uploaded_file = st.file_uploader("Choose your audio file", type=("mp3", "wav", "m4a"))

    if uploaded_file:
        try:
            with tempfile.TemporaryDirectory() as temp_dir:
                file_path = os.path.join(temp_dir, uploaded_file.name)
                
                with open(file_path, "wb") as f:
                    f.write(uploaded_file.getvalue())
                
                file_key = f"{session_id}-{uploaded_file.name}"
                st.write("Transcribing with AssemblyAI and storing in vector database...")

                if file_key not in st.session_state.get('file_cache', {}):
                    # Initialize transcriber
                    transcriber = Transcribe(api_key=os.getenv("ASSEMBLYAI_API_KEY"))
                    
                    # Get speaker-labeled transcripts
                    transcripts = transcriber.transcribe_audio(file_path)
                    st.session_state.transcripts = transcripts
                    
                    # Each speaker segment becomes a separate document for embedding
                    documents = (f"Speaker {t('speaker')}: {t('text')}" for t in transcripts)

                    # embed data    
                    embeddata = EmbedData(embed_model_name="BAAI/bge-large-en-v1.5", batch_size=batch_size)
                    embeddata.embed(documents)

                    # set up vector database
                    qdrant_vdb = QdrantVDB_QB(collection_name=collection_name,
                                          batch_size=batch_size,
                                          vector_dim=1024)
                    qdrant_vdb.define_client()
                    qdrant_vdb.create_collection()
                    qdrant_vdb.ingest_data(embeddata=embeddata)

                    # set up retriever
                    retriever = Retriever(vector_db=qdrant_vdb, embeddata=embeddata)

                    # set up rag
                    query_engine = RAG(retriever=retriever, llm_name="DeepSeek-R1-Distill-Llama-70B")
                    st.session_state.file_cache(file_key) = query_engine
                else:
                    query_engine = st.session_state.file_cache(file_key)

                # Inform the user that the file is processed
                st.success("Ready to Chat!")
                
                # Display audio player
                st.audio(uploaded_file)
                
                # Display speaker-labeled transcript
                st.subheader("Transcript")
                with st.expander("Show full transcript", expanded=True):
                    for t in st.session_state.transcripts:
                        st.text(f"**{t('speaker')}**: {t('text')}")
                
        except Exception as e:
            st.error(f"An error occurred: {e}")
            st.stop()     

col1, col2 = st.columns((6, 1))

with col1:
    st.markdown("""
    # RAG over Audio powered by   and 
""".format(base64.b64encode(open("assets/AssemblyAI.png", "rb").read()).decode(),
           base64.b64encode(open("assets/deep-seek.png", "rb").read()).decode()), unsafe_allow_html=True)

with col2:
    st.button("Clear ↺", on_click=reset_chat)

# Initialize chat history
if "messages" not in st.session_state:
    reset_chat()

# Display chat messages from history on app rerun
for message in st.session_state.messages:
    with st.chat_message(message("role")):
        st.markdown(message("content"))

# Accept user input
if prompt := st.chat_input("Ask about the audio conversation..."):
    # Add user message to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})
    # Display user message in chat message container
    with st.chat_message("user"):
        st.markdown(prompt)

    # Display assistant response in chat message container
    with st.chat_message("assistant"):
        message_placeholder = st.empty()
        full_response = ""
        
        # Get streaming response
        streaming_response = query_engine.query(prompt)
        
        for chunk in streaming_response:
            try:
                new_text = chunk.raw("choices")(0)("delta")("content")
                full_response += new_text
                message_placeholder.markdown(full_response + "▌")
            except:
                pass

        message_placeholder.markdown(full_response)

    # Add assistant response to chat history
    st.session_state.messages.append({"role": "assistant", "content": full_response})

Execute the App.py file in the terminal, with the following code, where you can load an audio file and interact with the chatbot.

streamlit run app.py

You can see the demonstration using the application here. And you can download the sample audio file from here.

Conclusion

We have successfully combined Assemyai, Sambanova Cloud, Qrant and Deepseek to build a chatbot that uses the generation of augmented recovery on the audio. The RAG_CODE.PY file manages the RAG workflow, while the APP.Py file provides a simple interface. I want you to interact with this chatbot using different audio files, adjust the code, add new functions and explore the infinite possibilities of audio -based chat solutions.

GITHUB repo: https://github.com/karthikponna/chat_with_audios/tree/main

Key control

Take advantage of the assembly for audio transcription allows a precise text marked with speaker, providing a solid base for advanced conversation experiences.
Qrant integration guarantees rapid vector -based recovery, offering rapid access to the relevant context for more informed responses.
The application of a RAG approach combines recovery and generation, guaranteeing answers based on real data.
The use of Sambanova Cloud for LLM offers a robust language understanding, promoting attractive, contextual interactions.
The use of Strewlit for the user interface offers a direct and interactive environment, which simplifies the implementation of audio -based chatbot.

The means shown in this article are not owned by Analytics Vidhya and are used at the author's discretion.

Frequent questions

Q1. What is the rag and how does it help to build this chatbot?

A. RAG means augmented recovery generation. It obtains relevant data from a vector database, ensuring that chatbot responses are based on a real context instead of only models predictions.

Q2. How do I customize the ragged model used at Rag_code.py?

A. Simply change the Entruste_Model_Name in the insqualidate class to your favorite clamp model, ensuring that you admit text embedding.

Q3. How can I modify the application template for different cases of use?

A. Adjust the QA_PROMPT_TMPL_STR in the RAG class to include the additional instructions or formatting necessary for its application.

Q4. Why use QDRANT to store inlays?

A. QDRONT provides an efficient vector search, which easily facilitates the relevant context within large integrated text sets.

Hello! I am Karthik Ponna, an automatic learning engineer in antern. I am very passionate about exploring the fields of ai and data science, since they constantly evolve and shape the future. I think that writing blogs is an excellent way of not only improving my skills and solidifying my understanding, but also to share my knowledge and ideas with others in the community. This helps me connect with people with related ideas who share a curiosity about technology and innovation.

Log in to continue reading and enjoying content cured by experts.

(Tagstotranslate) Blogathon

Build an audio rag with Assemyai, Qrant and Deepseek-R1

Technical Terrence Team

Walmart is selling a portable air conditioning of $ 469 for only $ 220, and buyers are "amazed" because of how cold it is made

Leave a Reply Cancel reply

Recommended.

Wall Street Expert Signals Bullish Target for Altcoin About to Overtake SOL and ETH

Google CEO Sundar Pichai on AI moment: “You’ll see us be bold”

Google AI launches two updated, production-ready Gemini models: Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002 with improved performance and lower costs

Circle will soon enable NFC contactless payments for USDC on iPhones, CEO says

Coti Enterprise Blockchain to Become Ethereum's Privacy-Focused Layer 2 in 2024

Categories

Important Links

Build an audio rag with Assemyai, Qrant and Deepseek-R1

Learning objectives

What is Assemyai?

What is Sambanova Cloud?

What is QDRANT?

What is Deepseek-R1?

Building the rag model with Assemyai and Deepseek-R1

1. required previous requirements

2. Increased generation recovery

2.1 Import necessary libraries

2.2 Batch processing and embedding with a hug face

2.3 QDRANT VECTOR DATABASE Settings and ingestion

2.4 Consult entrusting retriever

2.5 Smart rag consultation assistant

2.6 Audio transcription

3. Transmission application

Conclusion

Key control

Frequent questions

Log in to continue reading and enjoying content cured by experts.

Related

Technical Terrence Team

Walmart is selling a portable air conditioning of $ 469 for only $ 220, and buyers are "amazed" because of how cold it is made

Leave a Reply Cancel reply

Recommended.

Wall Street Expert Signals Bullish Target for Altcoin About to Overtake SOL and ETH

Google CEO Sundar Pichai on AI moment: “You’ll see us be bold”

Google AI launches two updated, production-ready Gemini models: Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002 with improved performance and lower costs

Circle will soon enable NFC contactless payments for USDC on iPhones, CEO says

Coti Enterprise Blockchain to Become Ethereum's Privacy-Focused Layer 2 in 2024

Categories

Important Links

Get daily news updates to your inbox!