Author's image | DALLE-3 and Canva
Have you ever had to deal with messy data sets? They are one of the biggest obstacles in any data science project. These data sets may contain inconsistencies, missing values, or irregularities that make analysis difficult. Data cleansing is the essential first step that lays the foundation for accurate and reliable information, but it is a long and time-consuming process.
Do not fear! Let me introduce you to Pyjanitor, a fantastic Python library that can save the day. It is a convenient Python package that provides a simple solution to these data cleansing challenges. In this article, I am going to discuss the importance of Pyjanitor along with its features and practical usage.
By the end of this article, you will have a clear understanding of how Pyjanitor simplifies data cleansing and its application in everyday data-related tasks.
What is Pyjanitor?
Pyjanitor is an extended Python R package built on top of Pandas that simplifies data preprocessing and cleaning tasks. It expands its functionality by offering a variety of useful functions that refine the process of cleaning, transforming and preparing data sets. Think of it as an upgrade to your data cleansing toolkit. Are you eager to learn about Pyjanitor? Me too. Let us begin.
Starting
First things first, you need to install Pyjanitor. Open your terminal or command prompt and run the following command:
The next step is to import Pyjanitor and Pandas into your Python script. This can be done by:
import janitor
import pandas as pd
You are now ready to use Pyjanitor for your data cleaning tasks. In the future, I will cover some of the most useful features of Pyjanitor, which are:
1. Cleaning column names
Raise your hand if you've ever been frustrated by inconsistent column names. Yeah, me too. With Pyjanitor clean_names()
With this feature, you can quickly standardize your column names to be uniform and consistent with a simple call. This powerful function replaces spaces with underscores, converts all characters to lowercase, removes leading and trailing spaces, and even replaces periods with underscores. Let's understand it with a basic example.
#Create a data frame with inconsistent column names
student_df = pd.DataFrame({
'Student.ID': (1, 2, 3),
'Student Name': ('Sara', 'Hanna', 'Mathew'),
'Student Gender': ('Female', 'Female', 'Male'),
'Course*': ('Algebra', 'Data Science', 'Geometry'),
'Grade': ('A', 'B', 'C')
})
#Clean the column names
clean_df = student_df.clean_names()
print(clean_df)
Production:
student_id student_name student_gender course grade
0 1 Sara Female Algebra A
1 2 Hanna Female Data Science B
2 3 Mathew Male Geometry C
2. Rename columns
Sometimes renaming columns not only improves our understanding of the data, but also improves its readability and consistency. Thanks to the rename_column()
This feature makes this task effortless. A simple example showing the usefulness of this feature is as follows:
student_df = pd.DataFrame({
'stu_id': (1, 2),
'stu_name': ('Ryan', 'James'),
})
# Renaming the columns
student_df = student_df.rename_column('stu_id', 'Student_ID')
student_df =student_df.rename_column('stu_name', 'Student_Name')
print(student_df.columns)
Production:
Index(('Student_ID', 'Student_Name'), dtype="object")
3. Handling missing values
Missing values are a real pain when working with data sets. Fortunately, fill_missing()
is useful in addressing these problems. Let's explore how to handle missing values using Pyjanitor with a practical example. First, we will create a dummy data frame and fill it with some missing values.
# Create a data frame with missing values
employee_df = pd.DataFrame({
'employee_id': (1, 2, 3, 4, 5),
'name': ('Ryan', 'James', 'Alicia'),
'department': ('HR', None, 'Engineering'),
'salary': (60000, 55000, None)
})
Now, let's see how Pyjanitor can help fill in these missing values:
# Replace missing 'department' with 'Unknown'
# Replace the missing 'salary' with the mean of salaries
employee_df = employee_df.fill_missing({
'department': 'Unknown',
'salary': employee_df('salary').mean(),
})
print(employee_df)
Production:
employee_id name department salary
0 1 Ryan HR 60000.0
1 2 James Unknown 55000.0
2 3 Alicia Engineering 57500.0
In this example, the employee's department 'James' is replaced by 'A stranger'and the salary of 'Alicia' is replaced by the average of “Ryan” and 'James' salaries. You can use several strategies to handle missing values, such as pass forward, pass backward, or fill in a specific value.
4. Filter rows and select columns
Filtering rows and columns is a crucial task in data analysis. Pyjanitor simplifies this process by providing functions that allow you to select columns and filter rows based on specific conditions. Suppose you have a data frame containing student records and you want to filter out the students (rows) whose grades are less than 60. Let's explore how Pyjanitor helps us achieve this.
# Create a data frame with student data
students_df = pd.DataFrame({
'student_id': (1, 2, 3, 4, 5),
'name': ('John', 'Julia', 'Ali', 'Sara', 'Sam'),
'subject': ('Maths', 'General Science', 'English', 'History''),
'marks': (85, 58, 92, 45, 75),
'grade': ('A', 'C', 'A+', 'D', 'B')
})
# Filter rows where marks are less than 60
filtered_students_df = students_df.query('marks >= 60')
print(filtered_students_df)
Production:
student_id name subject marks grade
0 1 John Math 85 A
2 3 Lucas English 92 A+
4 5 Sophia Math 75 B
Now suppose you also want to output only specific columns, such as just the name and ID, instead of your entire data. Pyjanitor can also help you do this as follows:
# Select specific columns
selected_columns_df = filtered_students_df.loc(:,('student_id', 'name'))
Production:
student_id name
0 1 John
2 3 Lucas
4 5 Sophia
5. Chaining methods
With Pyjanitor's method chaining feature, you can perform multiple operations in a single line. This ability stands out as one of its best features. To illustrate, let's consider a data frame containing data about cars:
# Create a data frame with sample car data
cars_df =pd.DataFrame ({
'Car ID': (101, None, 103, 104, 105),
'Car Model': ('Toyota', 'Honda', 'BMW', 'Mercedes', 'Tesla'),
'Price ($)': (25000, 30000, None, 40000, 45000),
'Year': (2018, 2019, 2017, 2020, None)
})
print("Cars Data Before Applying Method Chaining:")
print(cars_df)
Production:
Cars Data Before Applying Method Chaining:
Car ID Car Model Price ($) Year
0 101.0 Toyota 25000.0 2018.0
1 NaN Honda 30000.0 2019.0
2 103.0 BMW NaN 2017.0
3 104.0 Mercedes 40000.0 2020.0
4 105.0 Tesla 45000.0 NaN
Now we see that the data frame contains missing values and inconsistent column names. We can resolve this by performing operations sequentially, like clean_names()
, rename_column()
and, dropna()
etc on several lines. Alternatively, we can chain these methods (performing multiple operations on a single line) for a smooth workflow and cleaner code.
# Chain methods to clean column names, drop rows with missing values, select specific columns, and rename columns
cleaned_cars_df = (
cars_df
.clean_names() # Clean column names
.dropna() # Drop rows with missing values
.select_columns(('car_id', 'car_model', 'price')) #Select columns
.rename_column('price', 'price_usd') # Rename column
)
print("Cars Data After Applying Method Chaining:")
print(cleaned_cars_df)
Production:
Cars Data After Applying Method Chaining:
car_id car_model price_usd
0 101.0 Toyota 25000
3 104.0 Mercedes 40000
The following operations have been performed in this pipeline:
clean_names()
The function clears the column names.dropna()
The function removes rows with missing values.select_columns()
The function selects specific columns which are 'car_id', 'car_model' and 'price'.rename_column()
The function changes the name of the column 'price' to 'price_usd'.
Ending
So to conclude, Pyjanitor proves to be a magical library for anyone working with data. It offers many more features than those discussed in this article, such as encoding categorical variables, getting features and labels, identifying duplicate rows, and much more. All these advanced features and methods can be explored on your documentationThe deeper you dig into its features, the more you will be amazed by its powerful functionality. Finally, enjoy manipulating your data with Pyjanitor.
Kanwal Mehreen Kanwal is a machine learning engineer and technical writer with a deep passion for data science and the intersection of ai and medicine. She is the co-author of the eBook “Maximizing Productivity with ChatGPT.” As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She is also recognized as a Teradata Diversity in tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change and founded FEMCodes to empower women in STEM fields.