Creating data pipelines to build applications with large language models

Image of DALL-E 3

Currently, companies follow two approaches for LLM-driven applications: Fine tuning and Recovery Augmented Generation (RAG). At a very high level, RAG takes an input and retrieves a set of relevant/supporting documents given a source (e.g. company wiki). The documents are concatenated as context with the original input message and sent to the LLM model that produces the final response. RAG seems to be the most popular approach to bringing LLMs to market, especially in real time processing scenarios. The LLM architecture to support that most of the time includes creating an effective data pipeline.

In this post, we’ll explore different stages in the LLM data pipeline to help developers implement production-grade systems that work with their data. Follow the instructions to learn how to ingest, prepare, enrich, and serve data to power GenAI applications.

These are the different stages of an LLM process:

Data ingestion of unstructured data.

Vectorization with enrichment (with metadata)

Vector indexing (with real-time synchronization)

ai query processor

User interaction in natural language (with chat or API)

Data ingestion of unstructured data.

The first step is to collect the right data to help with business objectives. If you are creating a consumer-facing chatbot, you need to pay special attention to the data that will be used. Data sources can range from a company portal (e.g. Sharepoint, Confluent, document storage) to internal APIs. Ideally, you would want to have a mechanism for inserting these sources into the index so that your LLM application is up-to-date for your end consumer.

Organizations should implement data governance policies and protocols when extracting text data for in-context LLM training. Organizations can start by auditing document data sources to catalog sensitivity levels, license terms, and origin. Identify restricted data that needs redaction or exclusion from data sets.

The quality of these data sources must also be evaluated: diversity, size, noise levels, redundancy. Lower quality data sets will dilute the responses of Master of Laws applications. You may even need a document classification mechanism early to help with the right type of storage later in the process.

Adhering to data governance guardrails, even in accelerated LLM development, reduces risks. Establishing upfront governance mitigates many issues down the road and enables robust, scalable extraction of text data for contextual learning.

Pulling messages through Slack, Telegram, or Discord APIs provides access to real-time data is what helps RAG, but raw conversational data contains noise: typos, encoding issues, strange characters. Filtering real-time messages with offensive content or sensitive personal details that could be PII is an important part of data cleansing.

Vectorization with metadata

Metadata such as the author, date, and context of the conversation further enrich the data. This incorporation of external knowledge into vectors helps in smarter and more specific retrieval.

Some of the metadata related to the documents may be found in the portal or in the metadata of the document itself; However, if the document is attached to a business object (e.g. case information, customer, employee), then you will need to obtain that information from a relational database. . If there are security concerns around data access, this is a place where you can add security metadata that also helps with the recovery stage later.

A critical step here is to convert text and images into vector representations using the LLM embedding models. For documents, you should first perform fragmentation and then encode, preferably using local zero-shot embedding models.

Vector indexing

Vector representations must be stored somewhere. This is where vector databases or vector indexes are used to efficiently store and index this information as embeddings.

This becomes your “LLM source of truth” and should be in sync with your data sources and documents. Real-time indexing It becomes important if your LLM application serves clients or generates business-related information. You want to prevent your LLM application from being out of sync with your data sources.

Fast recovery with a query processor

When you have millions of business documents, getting the right content based on the user’s query becomes a challenge.

This is where the early stages of the process begin to add value: cleaning and enriching data by adding metadata and, most importantly, indexing data. This addition in context helps strengthen rapid engineering.

User interaction

In a traditional pipeline environment, you send data to a data warehouse and the analytics tool will pull reports from the warehouse. In an LLM process, an end-user interface is typically a chat interface that, at the simplest level, takes a query from the user and responds to the query.

The challenge with this new type of pipe is not only getting a prototype, but also putting it into production. This is where an enterprise-grade monitoring solution becomes important to track your vector pipelines and warehouses. The ability to obtain business data from structured and unstructured data sources becomes an important architectural decision. LLMs represent the state-of-the-art in natural language processing, and creating enterprise-grade data pipelines for LLM-powered applications keeps it ahead of the curve.

Here you have access to an available source. real-time stream processing framework.

Anup Surendran is vice president of products and product marketing and specializes in bringing ai products to market. She has worked with startups that have had two successful exits (to SAP and Kroll) and enjoys teaching others about how ai products can improve productivity within an organization.

Retrieval-Augmented Generation (RAG) in AI

Introduction The rapid advancements in Large Language Models (LLMs) have transformed the landscape of ai, offering unparalleled capabilities in natural language understanding and generation. LLMs have ushered in a new language understanding and generation era, with OpenAI’s GPT models at the forefront. These remarkable models honed on extensive online data,…

09/28/2023

In "A.I."

Protecting Your RAG Pipes: A Step-by-Step Guide to Implementing Llama Guard with LlamaIndex | by Wenqi Glantz | December 2023

How to add Llama Guard to your RAG pipelines to moderate LLM inputs and outputs and combat fast injectionImage generated by DALL-E 3 by the authorLLM security is an area that we all know deserves extensive attention. Organizations eager to adopt generative ai, from large to small, face a major…

12/27/2023

In "A.I."

Retrieval-Augmented Generation & RAG Workflows

IntroductionRetrieval Augmented Generation, or RAG, is a mechanism that helps large language models (LLMs) like GPT become more useful and knowledgeable by pulling in information from a store of useful data, much like fetching a book from a library. Here’s how RAG makes magic with simple ai workflows:Knowledge Base (Input):…

10/25/2023

In "A.I."

Creating data pipelines to build applications with large language models

Technical Terrence Team

Floki Inu Sister Coin TokenFi Pumps, Which Meme Coins Could Rise as Crypto Prices Turn Bullish

Leave a Reply Cancel reply

Recommended.

Google's NotebookLM now lets you guide your AI podcast hosts

Ford closes company that creates an application for plumbers, electricians and other trades

What are SWIFT payments and how do they work?

Start guide to execute LLMS Large Language Models

Voyager victim asks trustee to take control of estate

Categories

Important Links

Creating data pipelines to build applications with large language models

Data ingestion of unstructured data.

Vectorization with metadata

Vector indexing

Fast recovery with a query processor

User interaction

Related

Leave a Reply Cancel reply

Recommended.

Categories

Important Links

Get daily news updates to your inbox!