Case Study: Unstructured Location Descriptions of Car Accidents
Data Collection and Preparation
To test out and quantify the geocoding capabilities of LLMs, a list of 100 unstructured location descriptions of vehicle accidents in Minnesota were randomly selected from a dataset that was scraped from the web. The ground truth coordinates for all 100 accidents were manually created through the use of various mapping applications like Google Maps and the Minnesota Department of Transportation’s Traffic Mapping Application (TMA).
Some sample location descriptions are featured below.
US Hwy 71 at MN Hwy 60 , WINDOM, Cottonwood County
EB Highway 10 near Joplin St NW, ELK RIVER, Sherburne County
EB I 90 / HWY 22, FOSTER TWP, Faribault County
Highway 75 milepost 403, SAINT VINCENT TWP, Kittson County
65 Highway / King Road, BRUNSWICK TWP, Kanabec County
As seen in the examples above, there are wide variety of possibilities for how the description can be structured, as well as what defines the location. One example of this is the fourth description, which features a mile marker number, which is far less likely to be matched in any sort of geocoding process, since that information typically isn’t included in any sort of reference data. Finding the ground truth coordinates for descriptions like this one relied heavily on the use of the Minnesota Department of Transportation’s Linear Referencing System (LRS) which provides a standardized approach of how roads are measured through out the State, with which mile markers play a vital role in. This data can be accessed through the TMA application mentioned previously.
Methodology & Geocoding Strategies
After preparing the data, five separate notebooks were set up to test out different geocoding processes. Their configurations are as follows.
1. Google Geocoding API, used on the raw location description
2. Esri Geocoding API, used on the raw location description
3. Google Geocoding API, used on an OpenAI GPT 3.5 standardized location description
4. Esri Geocoding API, used on an OpenAI GPT 3.5 standardized location description
5. OpenAI GPT 3.5, used as a geocoder itself
To summarize, the Google and Esri geocoding APIs were used on both the raw descriptions as well as descriptions that were standardized using a short prompt that was passed into the OpenAI GPT 3.5 model. The Python code for this standardization process can be seen below.
def standardize_location(df, description_series):
df["ai_location_description"] = df[description_series].apply(_gpt_chat)return df
def _gpt_chat(input_text):
prompt = """Standardize the following location description into text
that could be fed into a Geocoding API. When responding, only
return the output text."""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": input_text},
],
temperature=0.7,
n=1,
max_tokens=150,
stop=None,
)
return response.choices[0].message.content.strip().split("\n")[-1]
The four test cases using geocoding APIs used the code below to make API requests to their respective geocoders and return the resulting coordinates for all 100 descriptions.
# Esri Geocoder
def geocode_esri(df, description_series):
df["xy"] = df[description_series].apply(
_single_esri_geocode
)df["x"] = df["xy"].apply(
lambda row: row.split(",")[0].strip()
)
df["y"] = df["xy"].apply(
lambda row: row.split(",")[1].strip()
)
df["x"] = pd.to_numeric(df["x"], errors="coerce")
df["y"] = pd.to_numeric(df["y"], errors="coerce")
df = df[df["x"].notna()]
df = df[df["y"].notna()]
return df
def _single_esri_geocode(input_text):
base_url = "https://geocode-api.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates"
params = {
"f": "json",
"singleLine": input_text,
"maxLocations": "1",
"token": os.environ["GEOCODE_TOKEN"],
}
response = requests.get(base_url, params=params)
data = response.json()
try:
x = data["candidates"][0]["location"]["x"]
y = data["candidates"][0]["location"]["y"]
except:
x = None
y = None
return f"{x}, {y}"
# Google Geocoder
def geocode_google(df, description_series):
df["xy"] = df[description_series].apply(
_single_google_geocode
)df["x"] = df["xy"].apply(
lambda row: row.split(",")[0].strip()
)
df["y"] = df["xy"].apply(
lambda row: row.split(",")[1].strip()
)
df["x"] = pd.to_numeric(df["x"], errors="coerce")
df["y"] = pd.to_numeric(df["y"], errors="coerce")
df = df[df["x"].notna()]
df = df[df["y"].notna()]
return df
def _single_google_geocode(input_text):
base_url = "https://maps.googleapis.com/maps/api/geocode/json"
params = {
"address": input_text,
"key": os.environ["GOOGLE_MAPS_KEY"],
"bounds": "43.00,-97.50 49.5,-89.00",
}
response = requests.get(base_url, params=params)
data = response.json()
try:
x = data["results"][0]["geometry"]["location"]["lng"]
y = data["results"][0]["geometry"]["location"]["lat"]
except:
x = None
y = None
return f"{x}, {y}"
Additionally, one final process tested was to use GPT 3.5 as the geocoder itself, without the help of any geocoding API. The code for this process looked nearly identical to the standardization code used above, but featured a different prompt, shown below.
Geocode the following address. Return a latitude (Y) and longitude (X) as accurately as possible. When responding, only return the output text in the following format: X, Y
Performance Metrics and Insights
After the various processes were developed, each process was run and several performance metrics were calculated, both in terms of execution time and geocoding accuracy. These metrics are listed below.
| Geocoding Process | Mean | StdDev | MAE | RMSE |
| ------------------- | ------ | ------ | ------ | ------ |
| Google with GPT 3.5 | 0.1012 | 1.8537 | 0.3698 | 1.8565 |
| Google with Raw | 0.1047 | 1.1383 | 0.2643 | 1.1431 |
| Esri with GPT 3.5 | 0.0116 | 0.5748 | 0.0736 | 0.5749 |
| Esri with Raw | 0.0001 | 0.0396 | 0.0174 | 0.0396 |
| GPT 3.5 Geocoding | 2.1261 | 80.022 | 45.416 | 80.050 |
| Geocoding Process | 75% ET | 90% ET | 95% ET | Run Time |
| ------------------- | ------ | ------ | ------ | -------- |
| Google with GPT 3.5 | 0.0683 | 0.3593 | 3.3496 | 1m 59.9s |
| Google with Raw | 0.0849 | 0.4171 | 3.3496 | 0m 23.2s |
| Esri with GPT 3.5 | 0.0364 | 0.0641 | 0.1171 | 2m 22.7s |
| Esri with Raw | 0.0362 | 0.0586 | 0.1171 | 0m 51.0s |
| GPT 3.5 Geocoding | 195.54 | 197.86 | 199.13 | 1m 11.9s |
The metrics are explained in more detail here. Mean represents the mean error (in terms of Manhattan distance, or the total of X and Y difference from the ground truth, in decimal degrees). StdDev represents the standard deviation of error (in terms of Manhattan distance, in decimal degrees). MAE represents the mean absolute error (in terms of Manhattan distance, in decimal degrees). RMSE represents the root mean square error (in terms of Manhattan distance, in decimal degrees). 75%, 90%, 95% ET represents the error threshold for that given percent (in terms of Euclidean distance, in decimal degrees), meaning that for a given percentage, that percentage of records falls within the resulting value’s distance from the ground truth. Lastly, run time simply represents the total time taken to run the geocoding process on 100 records.
Clearly, GPT 3.5 performs far worse on its own. Although, if a couple outliers are taken out of the picture (which were labelled by the model as being located in other continents), for the most part the results of that process don’t look too out of place, visually at least.
It is also interesting to see that the LLM-standardization process actually decreased accuracy, which I personally found a bit surprising, since my whole intention of introducing that component was to hopefully slightly improve the overall accuracy of the geocoding process. It is worth noting that the prompts themselves could have been a part of the problem here, and it is worth further exploring the role of “prompt engineering” in geospatial contexts.
The last main takeaway from this analysis is the execution time differences, with which any process that includes the use of GPT 3.5 performs significantly slower. Esri’s geocoding API is also slower than Google’s in this setting too. Rigorous testing was not performed, however, so these results should be taken with that into consideration.