In this blog post, we demonstrate quick engineering techniques to generate accurate and relevant analyzes of tabular data using industry-specific language. This is done by providing in-context sample data from large language models (LLMs) with features and tags in the message. The results are similar to fine-tuning LLMs without the complexities of model fine-tuning. We use a method called Generative Tabular Learning (GTL) based on the white paper. From Supervised to Generative: A Novel Paradigm for Tabular Deep Learning with Large Language Models and demonstrate the benefits of GTL using fully managed JupyterLab notebooks in amazon SageMaker notebooks to interact with Meta Llama models hosted on amazon SageMaker or amazon Bedrock. Additional reference notebooks can be found at samples-aws to learn how to use the Llama de Meta models hosted on amazon Bedrock.
Prerequisites
The following sections describe the prerequisites necessary for this demo. You can implement these steps from the AWS Management Console or by using the latest version of the AWS Command Line Interface (AWS CLI).
- Access to LLM such as Llama de Meta models hosted on amazon SageMaker or amazon Bedrock
- Configured amazon SageMaker domain with JupyterLab notebooks and the necessary Python libraries and packages to interact with the LLMs.
- Sample tabular data sets from the financial industry formatted as structured data (we are using ETF data from Kaggle) available for query using a SQL engine such as amazon Athena.
- Knowledge of generative ai prompt engineering techniques to provide LLMs with relevant context and sample data.
- Ability to evaluate and compare results generated by LLM to determine their accuracy and relevance to the analysis task.
- Understanding of financial industry data and knowledge of how to organize and query this data in a structured tabular format consumable by LLM.
- Knowledge of the industry domain to which the data belongs to determine appropriate characteristics and labels for sample data requests.
Financial industry data
In the financial industry, data can be in the form of a table in PDF files or structured data in a database. The following is an example of a financial information data set for exchange-traded funds (ETFs). Kaggle in a structured tabular format that we use to test our solution.
A user can ask a business or industry-related question about ETFs.
NOTE: Since we used a SQL query engine to query the data set for this demo, the prompts and generated results mention SQL below.
# Business question
question = "Please provide a list of about 100 ETFs or ETNs names with exposure to US markets"
# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
question=question,
table_schema=table_schema_etf
)
results = get_llm_sql_analysis(
question=question,
sql_sys_prompt=SQL_SYS_PROMPT,
qna_sys_prompt=QNA_SYS_PROMPT
)
Once the data is retrieved from the dataset, it is sent to the LLM hosted on amazon Bedrock (see the list of supported models on amazon Bedrock) for analysis and generates a response to the user's question or query in natural language.
The example question above does not require much complex analysis of the data returned by the ETF data set. We receive a response from the LLM based on their analysis of the data in industry-satisfactory or business-relevant language:
LLM SQL Analysis:
After analyzing the provided SQL query results, I can conclude that the list of ETFs/ETNs does not primarily focus on US markets. Instead, it appears to be a comprehensive list of bond ETFs/ETNs with a global scope, covering various regions, currencies, and bond types.
Here are some key observations:
1. **Global coverage**: The list includes ETFs/ETNs tracking bond markets in Europe (e.g., Eurozone, UK), the US, and globally diversified indices.
2. **Bond types**: The list covers a range of bond types, including corporate bonds, government bonds, high-yield bonds, and green bonds.
3. **Currency exposure**: ETFs/ETNs are denominated in various currencies, such as EUR, USD, and GBP, with some offering hedged exposure to mitigate currency risks.
4. **ESG and SRI focus**: A significant portion of the list consists of ETFs/ETNs with an Environmental, Social, and Governance (ESG) or Socially Responsible Investing (SRI) focus, which suggests a emphasis on sustainable investing.
To answer the original question, I can identify a subset of ETFs/ETNs from the list that have exposure to US markets:
**US-focused ETFs/ETNs:**
1. xxxx USD Corporate Bond 0-3yr ESG UCITS ETF USD (Dist)
2. xxxx USD Corporate Bond ESG 0-3yr UCITS ETF EUR Hedged (Acc)
3. xxxx ESG USD High Yield (DR) UCITS ETF - Dist
4. xxxx USD High Yield Corporate Bond ESG UCITS ETF USD (Acc)
5. xxxx USD High Yield Corporate Bond ESG UCITS ETF USD (Dist)
6. xxxx Index US Corporate SRI UCITS ETF DR (C)
7. xxxx Index US Corporate SRI UCITS ETF DR Hedged EUR (D)
8. xxxx USD Corporate Bond ESG UCITS ETF (Acc)
9. xxxx USD Corporate Bond ESG UCITS ETF (Dist)
10. xxxx ESG USD High Yield Corporate Bond UCITS ETF 1C
11. xxxx ETF (LU) xxxx xxxx US Liquid Corporates Sustainable UCITS ETF (USD) A-dis
12. xxxx USD Corporate Green Bond UCITS ETF 2C Acc USD
Please note that this subset is not exhaustive, and there may be other ETFs/ETNs in the original list that have some exposure to US markets. Additionally, investors should carefully evaluate the investment objectives, risks, and characteristics of each ETF/ETN before making any investment decisions.
NOTE: The output ETF names do not represent the actual data in the dataset used in this demo.
NOTE: The results generated by LLMs are not deterministic and may vary in their tests.
What would be the LLM answer or data analysis when industry-specific natural language user questions become more complex? To answer questions that require more complex analysis of data with industry-specific context, the model would need more information than relying solely on its pre-trained knowledge.
Solution Overview
We recommend that you think about this question before you begin: Can you improve the context provided to the LLM in the message along with the user's natural language question to generate better results, before attempting to tune the LLMs, which requires setting up MLOPS processes and environments, collect and prepare relevant and accurate labeled data sets, and more?
We propose an intermediate GTL framework using the Meta Llama model on amazon Bedrock. The proposed framework is not intended to replace the adjustment option. The following diagram illustrates this GTL framework for LLM.
GTL is a type of few-shot prompting technique in which we provide the following information about the data retrieved from the structured data set as part of the request to the LLM:
- A personality that the LLM will use when generating data analysis (providing suggestions to the model to use industry-specific data that it has already been pre-trained on)
- Data characteristics and descriptions.
- Data labels and descriptions
- A small sample data set containing features
- A sample analysis as an example.
The following is an example of a GTL message:
instructions = (
{
"role": "user",
"content": """Given the following SQL query results: {query_results}
And the original question: {question}
You are an expert in Exchange-Traded Funds or ETFs and Exchange-Traded Notes or ETNs .
Based on the features of the funds or notes, please predict how expensive the funds are for investors.
I will supply multiple instances with features and the corresponding label for reference.
Please refer to the table below for detailed descriptions of the features and label:
— feature description —
Features:
isin: International Securities Identification Number
wkn: Wertpapierkennnummer or German securities identification number
name: ETF Name
fundprovider: Financial Company providing the ETF
legalstructure: Exchange Traded Fund (ETF) or Exchange Traded Notes (ETN)
totalexpenseratio: An expense ratio is the cost of owning an ETF or ETN, the management fee paid to the fund company for the benefit of owning the fund,
paid annually and measured as a percent of your investment in the fund. 0.30 percent means you’ll pay $30 per year for every $10,000 you have invested in the fund.
— label description —
Expensive: Whether the fund is expensive for investors or not. 0 means not expensive, 1 means expensive.
— data —
|isin|wkn|name|fundprovider|legalstructure|totalexpenseratio|Expensive|
|GB00BNRRxxxx |A3xxxx|xxxx Physical Staked Cardano|xxxx|ETN|0.0|0|
|BGPLWIG0xxxx|A2xxxx|xxxx Poland WIGxxx UCITS ETF|xxxx|ETF|0.0138|0|
|CH044568xxxx|A2Txxxx|xxxx crypto Basket Index ETP|xxxx|ETN|0.025|1|
|CH1114873xxxx|A3Gxxxx|xxxx Solana ETP|xxxx|ETN|0.025|1|
|GB00BNRRxxxx|A3xxxx|xxxx Physical Staked Algorand|xxxx|ETN|0.0||
Please use the supplied data to predict the . Fund is expensive(1) or not(0)?
Answer: 0
Please provide an analysis and interpretation of the results to answer the original {question}.
"""
}
)
QNA_SYS_PROMPT = format_instructions(instructions)
In our GTL message, we highlight only the subset of columns in the data set that are relevant to produce an accurate and relevant analysis of the data based on the industry of the 129 possible columns in the EFT data set that we have chosen and also examples how to interpret the data.
Examples of use cases
Let's look at some examples of messages with generated analysis. The next question requires a complex analysis of multi-column data from the ETF database based on industry knowledge. In this example, the user is trying to find ETFs (funds) that provide higher dividends and lower volatility in value, which are a desired fund characteristic that investors look for in the funds they want to invest in.
# Business question
question = "Name the least risk funds that yields higher dividends and isn't volatile"
# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
question=question,
table_schema=table_schema_etf
)
results = get_llm_sql_analysis(
question=question,
sql_sys_prompt=SQL_SYS_PROMPT,
qna_sys_prompt=QNA_SYS_PROMPT
)
The following is our response without the GTL message:
LLM SQL Query:
SELECT
name,
yeardividendyield,
yearvolatilitycur,
fiveyearreturncur
FROM
etftable
WHERE
yeardividendyield > 3
AND yearvolatilitycur < 10
AND fiveyearreturncur > 5
ORDER BY
yearvolatilitycur ASC
LIMIT 10;
LLM SQL Analysis:
Based on the provided data, I will analyze the features and predict the GoodInvestment label for each fund. Then, I will identify
the least risk funds that yield higher dividends and aren't volatile.
**Analysis**
From the data, I observe the following:
1. **Low total expense ratio**: Funds with lower totalexpenseratio are more desirable, as they charge lower management fees.
A lower expense ratio indicates that the fund is cheaper to own.
2. **High five-year return**: Funds with higher fiveyearreturncur indicate better performance over the past 5 years.
3. **Non-zero yearly dividend yield**: Funds with non-zero yeardividendyield provide a regular income stream to investors.
**Prediction of GoodInvestment label**
Based on the analysis, I predict the GoodInvestment label for each fund as follows:
|isin|GoodInvestment|
| --- | --- |
|LU17815xxxx|1|
|LU17815xxxx7|1|
|IE00B5Bxxxx|0|
|LU14078xxxx|1|
|IE00B4xxxx|1|
|LU17815xxxx|1|
**Least risk funds that yield higher dividends and aren't volatile**
To identify the least risk funds, I will consider the following criteria:
1. **Low volatility**: Funds with lower fiveyearreturncur volatility are less risky.
2. **Higher dividend yield**: Funds with higher yeardividendyield provide a regular income stream to investors.
3. **Low expense ratio**: Funds with lower totalexpenseratio are cheaper to own.
Based on these criteria, I recommend the following funds:
1. **xxxxUK NT (DR) UCITS ETF - Dist (LU17815xxxx)**: This fund has a low expense ratio (0.04%), a relatively high dividend yield (0.0379),
and a moderate five-year return (0.2296).
2. **xxxxUS Equity (DR) UCITS ETF - Dist (LU17815xxxx)**: This fund has a low expense ratio (0.04%), a relatively high dividend yield (0.0196),
and a high five-year return (0.7337).
3. **xxxxEURO STOXX 50 UCITS ETF EUR (IE00B4Kxxxx)**: This fund has a moderate expense ratio (0.05%), a relatively high dividend yield (0.0297),
and a moderate five-year return (0.3927).
These funds offer a balance between dividend yield, low expense ratio, and moderate volatility, making them suitable for investors seeking a
relatively stable investment with a regular income stream.
We see that the data retrieval only uses the annual volatility and not the 3 and 5 year volatility information. Additionally, some of the funds do not have volatility data in the data set (there are no values for 1, 3, or 5 year volatility).
The following is a modified question that asks for additional column considerations for 3 and 5 year data.
# Business question
question = "Name the least risk funds that yields higher dividends and isn't volatile based on five year, three year and one year volatiliy data"
# Generate a prompt to get the LLM to provide an SQL query
SQL_SYS_PROMPT = PromptTemplate.from_template(tmp_sql_sys_prompt).format(
question=question,
table_schema=table_schema_etf
)
results = get_llm_sql_analysis(
question=question,
sql_sys_prompt=SQL_SYS_PROMPT,
qna_sys_prompt=QNA_SYS_PROMPT
)
We use the following tagged GTL message to interpret 1 year, 3 year, and 5 year data or missing data:
instructions = (
{
"role": "user",
"content": """Given the following SQL query results: {query_results}
And the original question: {question}
You are an expert in Exchange-Traded Funds or ETFs and Exchange-Traded Notes or ETNs .
Based on the features of the funds or notes, please predict best funds for investors to invest in.
I will supply multiple instances with features and the corresponding label for reference.
Please refer to the table below for detailed descriptions of the features and label:
— feature description —
Features:
isin: International Securities Identification Number
wkn: Wertpapierkennnummer or German securities identification number
name: ETF Name
fundprovider: Financial Company providing the ETF
legalstructure: Exchange Traded Fund (ETF) or Exchange Traded Notes (ETN)
yeardividendyield: Yearly Dividend yield as a percentage of total investment
fiveyearreturncur: Returns over past 5 year period as a percentage of investment
totalexpenseratio: An expense ratio is the cost of owning an ETF or ETN, the management fee paid to the fund company for the benefit of owning the fund,
paid annually and measured as a percent of your investment in the fund. 0.30 percent means you’ll pay $30 per year for every $10,000 you have invested in the fund.
— label description —
volatile: The fund has low fiveyearvolatilitycur, threeyearvolatilitycur, yearvolatilitycur. 0 means not volatile, 1 means volatile, 2 means cannot be determined.
— data —
|isin|name|fiveyearvolatilitycur|threeyearvolatilitycur|yearvolatilitycur|Risk|
|LU033504xxxx|xxxx II EUR Overnight Rate Swap UCITS ETF 1D|8.0E-4|9.0E-4|0.0011|0|
|FR001051xxxx|Lyxor Euro Overnight Return UCITS ETF - Acc|8.0E-4|9.0E-4|0.0011|0|
|LU0290358xxxx|xxxxII EUR Overnight Rate Swap UCITS ETF 1C|9.0E-4|0.001|0.0013|0|
|BGCROEX03xxxx|xxxx UCITS ETF|0.675|0.8606|1.4787|1|
|IE000RN0xxxx|xxxxDisruptive technology Real Estate UCITS ETF Acc||||2|
|GB00BNRRxxxx|xxxxPhysical Staked xxxx||||2|
|FR001075xxxx|xxxx ETF xxxx0-6 Months Euro Investment Grade UCITS ETF EUR (C)|0.0014|0.0016|0.0021||
Please use the supplied data to predict the . Fund is volatile(1) or not(0) or cannot-be-determined(2)?
Answer: 1
Please provide an analysis and interpretation of the results to answer the original {question}.
"""
}
)
We see that with additional prompts the model uses all volatility columns in the data set (1 year, 3 years and 5 years) and provides output suggestions for when data is present or missing in the volatility columns.
The following is our response with GTL prompts:
LLM SQL Query:
SELECT
name,
fundprovider,
yeardividendyield,
fiveyearvolatilitycur,
threeyearvolatilitycur,
yearvolatilitycur
FROM
etftable
WHERE
yeardividendyield > 0
AND fiveyearvolatilitycur < (SELECT AVG(fiveyearvolatilitycur) FROM etftable)
AND threeyearvolatilitycur < (SELECT AVG(threeyearvolatilitycur) FROM etftable)
AND yearvolatilitycur < (SELECT AVG(yearvolatilitycur) FROM etftable)
ORDER BY
yeardividendyield DESC,
fiveyearvolatilitycur,
threeyearvolatilitycur,
yearvolatilitycur
LIMIT 10;
LLM SQL Analysis:
Based on the provided data, I will analyze and interpret the results to answer the original question.
**Least Risk Funds with Higher Dividend Yields and Low Volatility**
To identify the least risk funds, I will consider the following criteria:
1. **Low Volatility**: Funds with low five-year, three-year, and one-year volatility (less than 0.1).
2. **Higher Dividend Yields**: Funds with a higher yearly dividend yield (greater than 0.07).
3. **Low Expense Ratio**: Funds with a low total expense ratio (less than 0.1).
After analyzing the data, I have identified the following funds that meet the above criteria:
1. **xxxx Global Select Dividend 100 Swap UCITS ETF 1D**:
* Yearly Dividend Yield: 0.0938
* Five-year Volatility: 0.1096
* Three-year Volatility: 0.1125
* One-year Volatility: 0.1236
* Expense Ratio: Not available in the data
2. **xxxx Emerging Markets Equity Income UCITS ETF**:
* Yearly Dividend Yield: 0.0797
* Five-year Volatility: 0.1532
* Three-year Volatility: 0.1471
* One-year Volatility: 0.1335
* Expense Ratio: Not available in the data
3. **xxxx FTSE EM High Dividend Low Volatility UCITS ETF**:
* Yearly Dividend Yield: 0.0787
* Five-year Volatility: 0.1566
* Three-year Volatility: 0.1621
* One-year Volatility: 0.1439
* Expense Ratio: Not available in the data
These funds have relatively low volatility and higher dividend yields compared to the other funds in the dataset. However, please note that the
expense ratio is not available in the data, which is an important factor to consider when evaluating the overall cost of investing in an ETF.
**Additional Considerations**
When evaluating these funds, investors should also consider other factors such as:
1. **Fund Provider**: The reputation and experience of the fund provider can impact the fund's performance and risk profile.
2. **Legal Structure**: The legal structure of the fund (ETF or ETN) can affect its tax efficiency, liquidity, and regulatory environment.
3. **Investment Strategy**: The fund's investment strategy and asset allocation can influence its risk profile and potential returns.
In conclusion, the three funds mentioned above appear to be the least risk funds with higher dividend yields and low volatility based
on the provided data. However, investors should conduct further research and consider additional factors before making an investment decision.
As we can see, data recovery is more precise. Additionally, the generated analysis considered all volatility information in the data set (1 year, 3 years, and 5 years) and took into account any present or missing volatility data.
From this result, it is recommended to construct a Selected set of GTL messages along with the most common user questions belonging to data sets that users will ask about. Prompts will need to be created by dataset specialists who have a deep understanding of the dataset from an industry perspective and can provide appropriate context to the LLMs. Organizations can use such a rapid library to create interactive applications that enable typical business users who may not have a deep knowledge or understanding of the underlying data sets to interact with and gain insights from these data sets using natural language questions. .
Conclusion
As newer and larger LLMs are released, they get better at generating analysis of structured data sets using industry-specific language. However, there is room for improvement in analyzing data from structured data sets. One option is to fine-tune the LLM to improve the relevance and language of data analysis generated using a specific business language. Tuning requires additional efforts and costs (collection of relevant data, labeling of the data, additional costs involved in procurement and provisioning, and maintenance of the tuning computing environment).
In this post, we show a method with brief directions using Meta Llama models available through amazon Bedrock that can improve industry or business-specific data analysis with just quick engineering. (For certain use cases, adjustments may be necessary. See amazon Bedrock pricing for estimated costs with or without the use of adjusted models.)
Try this solution with your own industry-specific use cases and data sets and let us know your feedback and questions in the comments.
NOTE: The blog authors do not provide any financial or investment advice in this blog post, nor do they recommend this data set or the ETFs mentioned in this data set.
About the authors
Randy DeFauw He is a Senior Principal Solutions Architect at AWS. He has an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also has an MBA from Colorado State University. Randy has held various positions in technology, from software engineering to product management. He entered the Big Data space in 2013 and continues to explore that area. He is actively working on projects in the ML space and has presented at numerous conferences including Strata and GlueCon.
Arghya Banerjee is a Senior Solutions Architect at AWS in the San Francisco Bay Area focused on helping customers adopt and use the AWS Cloud. It is focused on Big Data services, Data Lakes, Streaming and batch analysis and generative ai technologies.
Ravi Ganesha is a Senior Solutions Architect at AWS in the Austin, Texas area, focused on helping customers address their business problems through cloud adoption. It focuses on analytics, resilience, security and generative artificial intelligence technologies.
Varun Mehta He is a Senior Solutions Architect at AWS. He is passionate about helping customers build well-designed, enterprise-scale solutions in the AWS cloud. Work with strategic clients using ai/ML to solve complex business problems. Outside of work, he loves spending time with his wife and children.