In this post, I will present a paradigm recently developed in Anaplan to extract temporal information from natural language text, as part of a NLQ (natural language query) project. While I will focus on time extraction, the paradigm is versatile and applicable for analyzing various unstructured texts and extracting various patterns of information. This includes named entity recognition, text-to-SQL conversion, quantity extraction, and more.
The core of the paradigm lies in the construction of a flexible pipeline, which provides maximum flexibility, making it easy to fit a model to extract meaning from any conceivable expression in the language. It is based on a deep learning model (transformers), but for us it achieved 99.98% accuracy, which is relatively rare in machine learning methods. Additionally, it does not use LLM (Large Language Models), in fact it requires a minimal transformer model. This produces a compact and adaptive machine learning model, exhibiting the accuracy of rule-based systems.
For those looking to extract time, numerical value or phone number, facebook's tool facebook/duckling” rel=”noopener ugc nofollow” target=”_blank”>Ducky package offers a rules-based solution. However, if Duckling doesn't meet your requirements or you're eager to explore a new machine learning paradigm, read on.
Can LLMs capture the meaning?
LLMs, despite their capabilities, face challenges in analyzing these types of phrases and extracting their meaning comprehensively. Consider the expression “the first 15 weeks of last year.” Converting this to a date range requires the model to determine the current year, subtract one, and calculate the position of the 15th week as it adjusts for leap years. Language models were not built for this type of computation.
In my experience, LLMs can accurately generate the correct date range 90% to 95% of the time, but struggle with the remaining 5% to 10%, no matter what prompting techniques you use. Not to mention: LLMs are resource intensive and slow.
Fortunately, by following three principles, compact transformers can accomplish the task successfully.
- Separate information extraction from logical deduction.
- Automatically generate a data set using structured patterns.
- Limit generative ai to the required structure.
In this post, I'll cover the first two, since I covered the third one in ai-e772123428e4″ rel=”noopener”>a previous post.
Separate information extraction from logical deduction.
The first principle is to ensure that the function of the language model is to extract information from free text, rather than to make logical deductions: logical deductions can be easily implemented in code.
Consider the sentence: “How many movies were released two years ago?” The language model's task should be to identify that the relevant year is: this_year - 2
, without calculating the actual year (meaning you don't need to know the current year). Its objective is to analyze the meaning and structure unstructured language. Once that formula is extracted, we can implement its calculation in code.
To make this work, we introduce a Structured Time Language (STL) capable of expressing time elements. For example, “in 2020” translates to “TIME.year==2020” and “in three months” becomes “NOW.month==3”. While not all of the STL language is detailed here, it should be relatively intuitive: you can reference attributes like year, quarter, and month for absolute or relative time to NOW. The translation of “the last 12 weeks of last year” is “NOW.year==-1 AND TIME.week>=-12”
By removing any logical deduction or calculation from the task, we take a large load off the language model and allow it to focus on extracting information. This division of labor will significantly improve your accuracy. Once the translation process is complete, it is easy to develop code for a parser that reads the structured language and retrieves the necessary date range.
Since this is a translation task, from natural language to STL, we use an encoder-decoder transformer. We use the Hugging Face Bart Modelwhich can be easily adjusted for this task.
But how do we get the data to train the model?
Automatically generate a data set using structured patterns
Since there is no training data set for this translation task, we must generate it ourselves. This was done by following these steps:
Step one: Write functions to map date and time objects to both “natural language” and STL formats:
def since_year(datetime):
free_text = f“since {datetime.year}”
answer = f”TIME.year >= {datetime.year}”
return free_text, answerdef half_literal(datetime):
free_text = datetime.strftime(“%-d, %B %Y”)
answer = f”TIME.date >= {datetime}”
return free_text, answer
def until_quarter_year(datetime):
q = datetime.month//3
free_text = f”until Q{q}-{datetime.year}”
answer = f”TIME.year=={datetime.year} AND TIME.quarter=={q}”
Given a datetime object, these functions return a free text tuple and its corresponding STL, for example: “since 2020”, “TIME.year >= 2020”.
Second step: Sampling a random function and sampling a random date within a specific range:
date = np.random.choice(pd.date_range('1970/1/1', '2040/12/31'))
now insert the datetime into the function.
Step three– Add the free text to a random question (we can easily generate questions randomly or extract them from some question data set, their quality and meaning are not very important).
With this pipeline, we can quickly generate thousands of text-STL pairs, for example:
- “What was the GDP growth in the second quarter of 2019?”, “TIME.quarter==2 AND TIME.year==2019”
- “Since 2017, who has won the most Oscars?”, “TIME.year>=2017”
- “Who was the president on May 3, 2020?”, “TIME.date==2020/05/03”
This approach ensures flexibility to add new patterns effortlessly. If you encounter a time expression that is not covered by one of these functions (for example, “In N years”), you can write a function that generates examples for this pattern in seconds.
In practice, we can further optimize the efficiency of the code. Instead of separate functions for each pattern like “since 2020” and “until 2020”, we can randomly sample connective words like “from”, “to”, “in”, etc. This initial batch of features may take some time to develop, but you can quickly scale to hundreds of patterns. Subsequently, addressing any missing expressions becomes trivial, since the process is already established. With a few iterations, almost all relevant expressions can be covered.
Also, we don't need to cover all the expressions.: Since the transformer model we use is pre-trained on a huge corpus of text, it will generalize from the provided patterns to new ones.
Finally, we can use an LLM to generate more examples.. Just ask an LLM:
Hey, what's another way to write "What was the revenue until Aug 23"
And you can return:
"How much did we make before August 2023".
This data augmentation process can also be automated: by sending numerous examples to an LLM, thus adding variety to our data set. Since the role of the LLM is solely the creation of data sets, considerations of cost and speed become inconsequential.
By combining the flexibility of adding new patterns, generalization of the pre-trained model, and data augmentation using an LLM, we can effectively cover almost any expression.
The final principle of this paradigm is to restrict generative ai to produce only STL queries, ensuring compliance with the required structure. The method to achieve this was discussed, as well as a method to optimize the tokenization process. ai-e772123428e4″ rel=”noopener”>in a previous post.
By adhering to these three principles, we achieved an impressive accuracy of 99.98% on our test data set. Additionally, this paradigm gave us the flexibility to quickly address new and unsupported time expressions.
Summary
Large language models (LLMs) are not always the optimal solution for linguistic tasks. With the right approach, shallower transformer models can efficiently extract natural language information with high accuracy and flexibility, in reduced time and cost.
The key principles to remember are:
- Focus the model solely on information extraction, avoiding complex logical deductions. This may require generating a mediator language and implementing a parser and logical deduction in the code.
- Establish a pipeline to generate a data set and train a model, so that adding new features (new language patterns) is easy and fast. This process may include the use of an LLM, adding more variety to the data set.
- Limit model generation to the limitations of a structured language.
While this post focused on extracting time elements, the paradigm applies to extracting any free text information and structuring it into various formats. With this paradigm, the precision of a rules-based engine can be achieved, with the flexibility of a machine learning model.