Author's image | DALLE-3 and Canva
If you've ever had the opportunity to work with data, you've probably encountered the need to load JSON (short for JavaScript Object Notation) files into a Pandas DataFrame for further analysis. JSON files store data in a format that is clear for people to read and also easy for computers to understand. However, JSON files can sometimes be difficult to navigate. Therefore, we load them in a more structured format like DataFrames, which is set up like a spreadsheet with rows and columns.
I will show you two different ways to convert JSON data into a Pandas DataFrame. Before we discuss these methods, let's assume this dummy nested JSON file that I'll use as an example throughout this article.
{
"books": (
{
"title": "One Hundred Years of Solitude",
"author": "Gabriel Garcia Marquez",
"reviews": (
{
"reviewer": {
"name": "Kanwal Mehreen",
"location": "Islamabad, Pakistan"
},
"rating": 4.5,
"comments": "Magical and completely breathtaking!"
},
{
"reviewer": {
"name": "Isabella Martinez",
"location": "Bogotá, Colombia"
},
"rating": 4.7,
"comments": "A marvelous journey through a world of magic."
}
)
},
{
"title": "Things Fall Apart",
"author": "Chinua Achebe",
"reviews": (
{
"reviewer": {
"name": "Zara Khan",
"location": "Lagos, Nigeria"
},
"rating": 4.9,
"comments": "Things Fall Apart is the best of contemporary African literature."
})})}
The JSON data mentioned above represents a list of books, where each book has a title, author, and a list of reviews. Each review, in turn, has a reviewer (with name and location) and a rating and comments.
Method 1: use the json.load()
and pd.DataFrame()
functions
The simplest and most direct method is to use the integrated software json.load()
function to parse our JSON data. This will convert it to a Python dictionary and we can then create the DataFrame directly from the resulting Python data structure. However, it has a problem: it can only handle single nested data. So for the above case if you just use these steps with this code:
import json
import pandas as pd
#Load the JSON data
with open('books.json','r') as f:
data = json.load(f)
#Create a DataFrame from the JSON data
df = pd.DataFrame(data('books'))
df
Your result could look like this:
Production:
In the reviews column you can see the complete dictionary. Therefore, if you want the output to appear correctly, you must manually handle the nested structure. This can be done as follows:
#Create a DataFrame from the nested JSON data
df = pd.DataFrame((
{
'title': book('title'),
'author': book('author'),
'reviewer_name': review('reviewer')('name'),
'reviewer_location': review('reviewer')('location'),
'rating': review('rating'),
'comments': review('comments')
}
for book in data('books')
for review in book('reviews')
))
Updated output:
Here, we use list comprehension to create a flat list of dictionaries, where each dictionary contains the information of the book and the corresponding review. Then we create Pandas DataFrae using this.
However, the problem with this approach is that it requires more manual effort to manage the nested structure of the JSON data. And now that? Do we have any other option?
Completely! Well then. Since we are in the 21st century, facing such a problem without a solution seems unrealistic. Let's look at the other approach.
Method 2 (recommended): Use the json_normalize()
function
He json_normalize()
The Pandas library function is a better way to manage nested JSON data. Automatically flattens the nested structure of JSON data, creating a DataFrame from the resulting data. Let's take a look at the code:
import pandas as pd
import json
#Load the JSON data
with open('books.json', 'r') as f:
data = json.load(f)
#Create the DataFrame using json_normalize()
df = pd.json_normalize(
data=data('books'),
meta=('title', 'author'),
record_path="reviews",
errors="raise"
)
df
Production:
He json_normalize()
The function takes the following parameters:
- data: The input data, which can be a list of dictionaries or a single dictionary. In this case, it is the data dictionary loaded from the JSON file.
- record_path: The path in the JSON data to the records you want to normalize. In this case, it is the key 'reviews'.
- goal: Additional fields to include in the normalized output of the JSON document. In this case, we use the “title” and “author” fields. Note that metadata columns usually appear last. This is how this feature works. As for the analysis, it doesn't matter, but for some magical reason, you want these columns to appear sooner. Sorry, but you have to do it manually.
- mistakes: The error handling strategy, which can be “ignore”, “raise”, or “warn”. We have set it to “raise”, so if there are any errors during the normalization process, it will throw an exception.
Ending
Both methods have their own advantages and use cases, and the choice of method depends on the structure and complexity of the JSON data. If the JSON data has a highly nested structure, the json_normalize()
The function might be the most suitable option as it can handle nested data automatically. If the JSON data is relatively simple and flat, the pd.read_json()
The function might be the easiest and most direct approach.
When dealing with large JSON files, it is essential to think about memory usage and performance, as loading the entire file into memory may not work. Therefore, you may need to look at other options, such as streaming the data, lazy loading, or using a more memory-efficient format like Parquet.
Kanwal Mehreen Kanwal is a machine learning engineer and technical writer with a deep passion for data science and the intersection of ai with medicine. She is the co-author of the eBook “Maximize Productivity with ChatGPT.” As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She is also recognized as a Teradata Diversity in tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is a passionate advocate for change and founded FEMCodes to empower women in STEM fields.