Open ai Recently announced support for Structured exits in his last gpt-4o-2024–08–06 Models. Structured outputs in relation to large language models (LLMs) are nothing new: developers have used various rapid engineering techniques or third-party tools.
In this article, we'll explain what structured results are, how they work, and how you can apply them in your own LLM-based applications. While OpenAI's announcement makes it fairly easy to implement them using their APIs (as we'll demonstrate here), you may prefer to go open source. Schemes package (maintained by the lovely people at dottxt), as it can be applied to both self-hosted open weight models (e.g. Mistral and LLaMA), as well as proprietary APIs (Disclaimer: due to This problem At the time of writing, Outlines does not support generating structured JSON via the OpenAI APIs (but that will change soon!).
Yeah ai/blog/redpajama-data-v2″ rel=”noopener ugc nofollow” target=”_blank”>RedPajama dataset As an example, the vast majority of pre-training data is human text. Therefore, “natural language” is the native domain of LLMs, both in input and output. However, when we build applications, we would like to use formal structures or machine-readable schemas to encapsulate our data input/output. In this way, we build robustness and determinism into our applications.
Structured exits is a mechanism by which we apply a predefined schema on the LLM output. This typically means that we apply a JSON schema, however, it is not limited to JSON only: in principle, we could apply XML, Markdown, or a completely custom schema. The benefits of structured outputs are twofold:
- Simpler message design – need No Be too detailed when specifying what the output should look like
- Deterministic names and types – can guarantee to obtain for example an attribute
age
with aNumber
JSON type in LLM's response
For this example, we will use the first sentence of Wikipedia entry for Sam Altman…
Samuel Harris Altman (born April 22, 1985) is an American entrepreneur and investor best known as the CEO of OpenAI since 2019 (he was briefly fired and reinstated in November 2023).
…and we will use the latest GPT-4o checkpoint as the named entity recognition (NER) system. We will apply the following JSON schema:
json_schema = {
"name": "NamedEntities",
"schema": {
"type": "object",
"properties": {
"entities": {
"type": "array",
"description": "List of entity names and their corresponding types",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The actual name as specified in the text, e.g. a person's name, or the name of the country"
},
"type": {
"type": "string",
"description": "The entity type, such as 'Person' or 'Organization'",
"enum": ("Person", "Organization", "Location", "DateTime")
}
},
"required": ("name", "type"),
"additionalProperties": False
}
}
},
"required": ("entities"),
"additionalProperties": False
},
"strict": True
}
In essence, our LLM response should contain a NamedEntities
object, consisting of an array of entities
each containing a name
and type
There are a few things to keep in mind here. For example, we can enforce Enumeration type, which is very useful in NER since we can restrict the output to a fixed set of entity types. We must specify all fields in the required
array: however, we can also emulate “optional” fields by setting the type to e.g. ("string", null)
.
Now we can pass our schema, along with data and instructions to the API. We need to complete the response_format
Discussion with a dictation where we settle type
to "json_schema”
and then provide the corresponding schema.
completion = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=(
{
"role": "system",
"content": """You are a Named Entity Recognition (NER) assistant.
Your job is to identify and return all entity names and their
types for a given piece of text. You are to strictly conform
only to the following entity types: Person, Location, Organization
and DateTime. If uncertain about entity type, please ignore it.
Be careful of certain acronyms, such as role titles "CEO", "CTO",
"VP", etc - these are to be ignore.""",
},
{
"role": "user",
"content": s
}
),
response_format={
"type": "json_schema",
"json_schema": json_schema,
}
)
The result should look like this:
{ 'entities': ( {'name': 'Samuel Harris Altman', 'type': 'Person'},
{'name': 'April 22, 1985', 'type': 'DateTime'},
{'name': 'American', 'type': 'Location'},
{'name': 'OpenAI', 'type': 'Organization'},
{'name': '2019', 'type': 'DateTime'},
{'name': 'November 2023', 'type': 'DateTime'})}
The complete source code used in this article is available here.
The magic is in the combination of restricted samplingand context free grammar (CFG). We have already mentioned that the vast majority of pre-training data is “natural language”. Statistically, this means that at each decoding or sampling step, there is a non-negligible chance of taking an arbitrary sample from the learned vocabulary (and in modern LLMs, vocabularies typically span over 40,000 samples). However, when working with formal schemes, we would like to quickly eliminate all unlikely samples.
In the previous example, if we have already generated…
{ 'entities': ( {'name': 'Samuel Harris Altman',
…then, the ideal would be to place a very high logit bias on the 'typ
token in the next decoding step and a very low probability for all other tokens in the vocabulary.
This is essentially what happens. When we provide the schema, it becomes a formal grammar, or CFG, that serves to guide the logit bias values during the decoding step. CFG is one of those old-school natural language processing (NLP) and computer science mechanisms that is coming back into fashion. A very good introduction to CFG was actually presented in This StackOverflow Answerbut essentially it is a way of describing transformation rules for a collection of symbols.
Structured results are nothing new, but they are certainly becoming a priority in proprietary APIs and LLM services. They provide a bridge between the erratic and unpredictable “natural language” domain of LLMs and the deterministic and structured domain of software engineering. Structured results are essentially a has to for anyone designing complex LLM applications where LLM results need to be shared or “presented” across multiple components. While native API support has finally arrived, developers should also consider using libraries such as Outlines as they provide an LLM/API-agnostic way of handling structured results.