People use large language models to perform various tasks with text data from different sources. Such tasks may include (but are not limited to) editing, summarizing, translating or extracting text. One of the main challenges of this workflow is ensuring that your data is ai-ready. This article briefly describes what it means to be ai-ready and provides some no-code solutions to get there.
We are surrounded by vast collections of unstructured text data from different sources, including web pages, PDF files, emails, organizational documents, etc. In the age of ai, these unstructured text documents can be essential sources of information. For many people, the typical workflow for unstructured text data involves sending a message with a block of text to the large language model (LLM).
While the copy and paste method is a standard strategy for working with LLM, you will likely encounter situations where this does not work. Consider the following:
- While many premium models allow you to upload and process documents, the file size is restricted. If the file is too large, you will need other strategies to get the relevant text into the model.
- You may want to process only a small section of text from a larger document. Providing the entire document to the LLM may interfere with the completion of the assignment due to irrelevant text.
- Some text documents and web pages, especially PDFs, contain a lot of formatting that can interfere with the way the text is processed. You may not be able to use the copy and paste method due to the format of the document: tables and columns can be problematic.
Being ai-ready means your data is in a format that an LLM can easily read and process. For text data processing, the data is in plain text with a format that is easily interpreted by the LLM. The Markdown file type is ideal for ensuring your data is ai-ready.
Plain text is the most basic file type on your computer. This is usually denoted as a .TXT extension. Many different _editors_ can be used to create and edit plain text files in the same way that Microsoft Word is used to create and edit stylized documents. For example, the Notepad app on a PC or the TextEdit app on a Mac are default text editors. However, unlike Microsoft Word, plain text files do not allow text to be stylized (e.g., bold, underline, italic, etc.). They are files with only raw characters in plain text format.
Markdown files are plain text files with the extension .Maryland. What makes the markdown file unique is the use of certain characters to indicate the format. These special characters are interpreted by Markdown-compatible applications to represent text with specific styles and structures. For example, text surrounding asterisks will appear in italics, while double asterisks will display the text in bold. Markdown also provides easy ways to create headings, lists, links, and other standard document elements, all while keeping the file as plain text.
The relationship between Markdown and Large Language Models (LLM) is simple. Markdown files contain plain text content that can be quickly processed and understood by LLMs. LLMs can recognize and interpret Markdown formatting as meaningful information, improving text comprehension. Markdown uses hashtags for titles, which create a hierarchical structure. A single hashtag denotes a level 1 header, two hashtags a level 2 header, three hashtags a level 3 header, and so on. These titles serve as contextual cues for LLMs when processing information. Models can use this structure to better understand the organization and importance of different sections of the text.
By recognizing the elements of Markdown, LLMs can grasp the content and its intended structure and emphasis. This leads to more accurate interpretation and text generation. The relationship allows LLMs to extract additional meaning from the text structure beyond the words themselves, improving their ability to understand and work with Markdown-formatted documents. Additionally, LLMs typically display their results in Markdown format. Therefore, you can have a much more streamlined workflow when working with LLM when sending and receiving markdown content. You will also find that many other applications allow Markdown formatting (e.g. Slack, Discord, GitHub, Google Docs).
There are many resources on the Internet to learn Markdown. Here are some valuable resources. Take some time to learn the markdown format.
This section explores essential tools for managing Markdown and integrating it with Large Language Models (LLM). The workflow involves several key steps:
- Source material: We start with structured text sources, such as PDF files, web pages, or Word documents.
- Conversion: Using specialized tools, we convert these formatted texts to plain text, specifically to the Markdown format.
- Storage (optional): Converted Markdown text can be stored in its original form. This step is recommended if you reuse or reference the text later.
- LLM Processing: Markdown text is then entered into an LLM.
- Generation of results: The LLM processes the data and generates output text.
- Result Storage: The LLM result can be stored for later use or analysis.
This workflow efficiently converts various types of documents into a format that LLMs can process quickly while maintaining the option to store both input and output for future reference.
Obsidian: Save and store plain text
Obsidian is one of the best options available for saving and storing plain text and markdown files. When I extract plain text content from PDF files and web pages, I typically save that content in Obsidian, a free text editor ideal for this purpose. I also use Obsidian for my other work, including taking notes and saving prompts. This is a fantastic tool worth learning.
Obsidian is simply a tool for saving and storing plain text content. You'll probably want this part of your workflow, but it's NOT required!
Jina ai – Reader: Extract plain text from websites
Jina ai is one of my favorite ai companies. Create a set of tools for working with LLM. Jina ai Reader is an extraordinary tool that converts a web page into Markdown format, allowing you to capture content in plain text to be processed by an LLM. The process is very simple. Add ai/`” rel=”noopener ugc nofollow” target=”_blank”>https://r.jina.ai/ to any URL and you will receive ai-ready content for your LLM.
For example, consider the following screenshot of large language models on Wikipedia: en.wikipedia.org/wiki/Large_language_model
Suppose we only want to use the text about LLM contained on this page. Extracting that information can be done using the copy and paste method, but it will be cumbersome with the rest of the formatting. However, we can use Jina ai-Reader by adding `ai`” rel=”noopener ugc nofollow” target=”_blank”>https://r.jina.ai` at the beginning of the URL:
This returns everything in a markdown format:
From here, we can easily copy and paste the relevant content into the LLM. Alternatively, we can save markdown content to Obsidian, allowing it to be reused over time. While Jina ai offers premium services at a very low cost, you can use this tool for free.
LlamaParse: Extracting Plain Text from Documents
Heavily formatted PDF files and other stylized documents present another common challenge. When working with large language models (LLMs), we often need to eliminate formatting to focus on content. Consider a scenario where you want to use only specific sections of a PDF report. The complex styling of the document makes copy and paste impractical. Additionally, if you upload the entire document to an LLM, you may have difficulty identifying and processing only the desired sections. This situation requires a tool that can separate content from format. LlamaParse by LlamaIndex addresses this need by effectively decoupling text from its stylistic elements.
To access LlamaParse, you can log in to LlamaCloud: ai/login” rel=”noopener ugc nofollow” target=”_blank”>https://cloud.llamaindex.ai/login. After signing in to LlamaCloud, go to LlamaParse on the left side of the screen:
Once you have accessed the analysis function, you can extract the content by following these steps. First, change the mode to “Precise”, which creates results in markdown format. Second, drag and drop your document. It can analyze many different types of documents, but my experience is that you will typically need to analyze PDF files, Word files, and PowerPoint. Just keep in mind that it can process many different file types. In this example, I use a publicly available report by the American Board of Social Work. This is a very stylized 94-page report.
Now, you can copy and paste the content from Markdown or you can export the entire file in Markdown.
On the free plan, you can scan 1000 pages per day. LlamaParse has many other features worth exploring.
Preparing text data for ai analysis involves several strategies. While using these techniques may seem challenging initially, practice will help you become more familiar with the tools and workflows. Over time, you will learn to apply them efficiently to your specific tasks.