Parsing and analyzing semi-structured and unstructured documents using LLM technology | by Umair Ali Khan | August 2024

The code to implement this entire workflow is available at GitHub.

Let's go through these steps one by one.

1. Text extraction

The documents used in this example include the ai advisory feedback we provide to companies after an advisory session. These companies include startups and established companies that want to integrate ai into their business or want to improve their existing ai solutions. The feedback document is a semi-structured document, the format of which is shown below. Names and other information in this document have been changed due to privacy restrictions.

Example of feedback document from our ai consultancy (image by the author)

ai experts provide their analysis for each field. However, with hundreds of such documents, extracting insights from the data becomes a difficult task. To gain insights from this data, it needs to be converted into a concise and structured format that can be analyzed using existing statistical or machine learning methods. Performing this conversion manually is not only laborious and time-consuming, but is also prone to errors.

In addition to the information readily visible in the document, such as the company name, date of consultation, and experts involved, my goal was to extract specific details for further analysis. These included the primary industry or domain in which each company operates, a concise description of current solutions offered, ai topics, company type, ai maturity level, objective, and a brief summary of recommendations. This extraction needed to be done on the detailed text associated with each field. Furthermore, the feedback template has evolved over time, leading to inconsistently formatted documents.

Before we discuss extracting text from documents, please note that the following libraries need to be installed to run the full code used in this article.

# Install the required libraries
!pip install tqdm  # For displaying a progress bar for document processing
!pip install requests  # For making HTTP requests
!pip install pandas  # For data manipulation and analysis
!pip install python-docx  # For processing Word documents
!pip install plotly  # For creating interactive visualizations
!pip install numpy  # For numerical computations
!pip install scikit-learn  # For machine learning algorithms and tools
!pip install matplotlib  # For creating static, animated, and interactive plots
!pip install openai  # For interacting with the OpenAI API
!pip install seaborn  # For statistical data visualization

The following code extracts text from a document (.docx format) using python docx library. It is important to extract text from all formats, including paragraphs, tables, headers and footers.

def extract_text_from_docx(docx_path: str):
"""
Extract text content from a Word (.docx) file.
"""
doc = docx.Document(docx_path)
full_text = ()# Extract text from paragraphs
for para in doc.paragraphs:
full_text.append(para.text)
# Extract text from tables
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
full_text.append(cell.text)
# Extract text from headers and footers 
for section in doc.sections:
header = section.header
footer = section.footer
for para in header.paragraphs:
full_text.append(para.text)
for para in footer.paragraphs:
full_text.append(para.text)
return '\n'.join(full_text).strip()

2. Set LLM indications

We need to instruct the LLM on how to extract the required information from the documents. In addition, we need to explain the meaning of each field of interest to be extracted so that it can extract the semantically matching information from the documents. This is particularly important because a required field comprising one or more words can be interpreted in several ways. For example, we need to explain what we mean by “aim”, which basically refers to the company’s plans for ai integration or how it wants to move forward with its current solution. Therefore, crafting the right message for this purpose is very important.

I put the instructions in System prompt to guide the conduct of the LLM. request for entry includes the data to be processed by the LLM. The system message is shown below.

# System prompt with extraction instructions
system_message = """
You are an expert in analyzing and extracting information from the feedback forms written by ai experts after ai advisory sessions with companies.  
Please carefully read the provided feedback form and extract the following 15 key information. Make sure that the key names are exactly the same as 
given below. Do not create any additional key names other than these 15. 
Key names and their descriptions:
1. Company name: name of the company seeking ai advisory
2. Country: Company's country (output 'N/A' if not available)
3. Consultation Date (output 'N/A' if not available)
4. Experts: persons providing ai consultancy (output 'N/A' if not available)
5. Consultation type: Regular or pop-up (output 'N/A' if not available)
6. Area/domain: Field of the company’s operations. Some examples: healthcare, industrial manufacturing, business development, education, etc. 
7. Current Solution: description of the current solution offered by the company. The company could be currently in ideation phase. Some examples of ‘Current Solution’ field include i) Recommendation system for cars, houses, and other items, ii) Professional guidance system, iii) ai-based matchmaking service for educational peer-to-peer support. (Be very specific and concise)
8. ai field: ai's sub-field in use or required. Some examples: image processing, large language models, computer vision, natural language processing, predictive modeling, speech recognition, etc. (This field is not explicitly available in the document. Extract it by the semantic understanding of the overall document.)
9. ai maturity level: low, moderate, high (output 'N/A' if not available).
10. Company type: ‘startup’ or ‘established company’
11. Aim: The ai tasks the company is looking for. Some examples: i) Enhance ai-driven systems for diagnosing heart diseases, ii) to automate identification of key variable combinations in customer surveys, iii) to develop ai-based system for automatic quotation generation from engineering drawings, iv) to building and managing enterprise-grade LLM applications. (Be very specific and concise)
12. Identified target market: The targeted customers. Some examples: healthcare professionals, construction firms, hospitality, educational institutions, etc. 
13. Data Requirement Assessment: The type of data required for the intended ai integration? Some examples: Transcripts of therapy sessions, patient data, textual data, image data, videos, etc. 
14. FAIR Services Sought: The services expected from FAIR. For instance, technical advice, proof of concept. 
15. Recommendations: A brief summary of the recommendations in the form of key words or phrase list. Some examples: i) Focus on data balance, monitor for bias, prioritize transparency, ii) Explore machine learning algorithms, implement decision trees, gradient boosting. (Be very specific and concise) 
Guidelines:
- Very important: do not make up anything. If the information of a required field is not available, output ‘N/A’ for it.
- Output in JSON format. The JSON should contain the above 15 keys.
"""

It is important to highlight what the LLM should focus on. For example, the number of key elements to be extracted, using exactly the same field names as specified, and not making up any information if it is not available. It is also important to include an explanation of each field and some examples of the required information (if possible). It is worth mentioning that an optimal message may not be produced on the first attempt.

3. Process documents

Document processing refers to sending the data to an LLM for analysis. I used OpenAI software. gpt-4o-mini Document analysis model that is a small, affordable and smart model for fast and light tasks. GPT-4o mini It is cheaper and has more capacity than GPT-3.5 Turbo. However, lightweight versions of open LLMs such as Calls, ai/getting-started/models/” rel=”noopener ugc nofollow” target=”_blank”>Mistraleither Fi-3 It can also be tested for this purpose.

The following code traverses a directory and its subdirectories to find ai advice documents (.docx format), extract text from each document, and send the document to gpt-40-minvia an API call.

def process_files(directory_path: str, api_key: str, system_message: str):
"""
Process all .docx files in the given directory and its subdirectories,
send their content to the LLM, and store the JSON responses.
"""
json_outputs = ()
docx_files = ()# Walk through the directory and its subdirectories to find .docx files
for root, dirs, files in os.walk(directory_path):
for file in files:
if file.endswith(".docx"):
docx_files.append(os.path.join(root, file))
if not docx_files:
print("No .docx files found in the specified directory or sub-directories.")
return json_outputs
# Iterate through all .docx files in the directory with a progress bar
for file_path in tqdm(docx_files, desc="Processing files...", unit="file"):
filename = os.path.basename(file_path)
extracted_text = extract_text_from_docx(file_path)
# Prepare the user message with the extracted text
input_message = extracted_text
# Prepare the API request payload
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
payload = {
"model": "gpt-4o-mini",
"messages": (
{"role": "system", "content": system_message},
{"role": "user", "content": input_message}
),
"max_tokens": 2000,
"temperature": 0.2
}
# Send the request to the LLM API
response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
# Extract the JSON response
json_response = response.json()
content = json_response('choices')(0)('message')('content').strip("```json\n").strip("```")
parsed_json = json.loads(content)
# Normalize the parsed JSON output
normalized_json = normalize_json_output(parsed_json)
# Append the normalized JSON output to the list
json_outputs.append(normalized_json)
return json_outputs

In the call payload, I set the maximum number of tokens (maximum_tokens) to 2000 to accommodate input/output tokens. I set a relatively low value temperature (0.2) so that the LLM does not have great creativity, which is not necessary for this task. A high temperature can cause hallucinations in which the LLM can invent new information.

The LLM response is received in a JSON object and is further parsed and normalized as explained in the next section.

4. Analyze the output of LLM

As shown in the code above, the API response is received in a JSON object (analyzed_json) which is further normalized using the following function.

def normalize_json_output(json_output):
"""
Normalize the keys and convert list values to comma-separated strings.
"""
normalized_output = {}
for key, value in json_output.items():
normalized_key = key.lower().replace(" ", "_")
if isinstance(value, list):
normalized_output(normalized_key) = ', '.join(value)
else:
normalized_output(normalized_key) = value
return normalized_output

This function standardizes JSON object keys by converting them to lowercase and replacing spaces with underscores. It also converts list values to comma-separated strings to make the data easier to work with and parse.

The normalized JSON object (json outputs), which contains the key information extracted from all the documents, is finally saved in an Excel file.

def save_json_to_excel(json_outputs, output_file_path: str):
"""
Save the list of JSON objects to an Excel file with a SNO. column.
"""
# Convert the list of JSON objects to a DataFrame
df = pd.DataFrame(json_outputs)# Add a Serial Number (SNO.) column
df.insert(0, 'SNO.', range(1, len(df) + 1))
# Ensure all columns are consistent and save the DataFrame to an Excel file
df.to_excel(output_file_path, index=False)

Below is a snapshot of the Excel file. The LLM-powered analysis produced accurate information about the required fields. The “N/A” in the snapshot represents data that is not available in the documents (old comment templates did not have this information).

Parsing and analyzing semi-structured and unstructured documents using LLM technology | by Umair Ali Khan | August 2024

Technical Terrence Team

Morgan Stanley upgrades Nucor, says steel prices have bottomed (NYSE:NUE)

Leave a Reply Cancel reply

Recommended.

Amazon SageMaker Domain in VPC only mode to support SageMaker Studio with auto shutdown Lifecycle Configuration and SageMaker Canvas with Terraform

An after-school education program aims to diversify the tech industry

Google Deepmind makes learning efficient RL data reinforcement with world models of improved transformers

Casio's Dimension Tripper lets you control your guitar pedals with your guitar strap

Tile's new AirTag competitors now double as panic buttons

Categories

Important Links

Parsing and analyzing semi-structured and unstructured documents using LLM technology | by Umair Ali Khan | August 2024

1. Text extraction

2. Set LLM indications

3. Process documents

4. Analyze the output of LLM

Related

Technical Terrence Team

Morgan Stanley upgrades Nucor, says steel prices have bottomed (NYSE:NUE)

Leave a Reply Cancel reply

Recommended.

Amazon SageMaker Domain in VPC only mode to support SageMaker Studio with auto shutdown Lifecycle Configuration and SageMaker Canvas with Terraform

An after-school education program aims to diversify the tech industry

Google Deepmind makes learning efficient RL data reinforcement with world models of improved transformers

Casio's Dimension Tripper lets you control your guitar pedals with your guitar strap

Tile's new AirTag competitors now double as panic buttons

Categories

Important Links

Get daily news updates to your inbox!