A Beginner's Guide to Data Cleaning with Pyjanitor

Author's image | DALLE-3 and Canva

Have you ever had to deal with messy data sets? They are one of the biggest obstacles in any data science project. These data sets may contain inconsistencies, missing values, or irregularities that make analysis difficult. Data cleansing is the essential first step that lays the foundation for accurate and reliable information, but it is a long and time-consuming process.

Do not fear! Let me introduce you to Pyjanitor, a fantastic Python library that can save the day. It is a convenient Python package that provides a simple solution to these data cleansing challenges. In this article, I am going to discuss the importance of Pyjanitor along with its features and practical usage.

By the end of this article, you will have a clear understanding of how Pyjanitor simplifies data cleansing and its application in everyday data-related tasks.

What is Pyjanitor?

Pyjanitor is an extended Python R package built on top of Pandas that simplifies data preprocessing and cleaning tasks. It expands its functionality by offering a variety of useful functions that refine the process of cleaning, transforming and preparing data sets. Think of it as an upgrade to your data cleansing toolkit. Are you eager to learn about Pyjanitor? Me too. Let us begin.

Starting

First things first, you need to install Pyjanitor. Open your terminal or command prompt and run the following command:

The next step is to import Pyjanitor and Pandas into your Python script. This can be done by:

import janitor
import pandas as pd

You are now ready to use Pyjanitor for your data cleaning tasks. In the future, I will cover some of the most useful features of Pyjanitor, which are:

1. Cleaning column names

Raise your hand if you've ever been frustrated by inconsistent column names. Yeah, me too. With Pyjanitor clean_names() With this feature, you can quickly standardize your column names to be uniform and consistent with a simple call. This powerful function replaces spaces with underscores, converts all characters to lowercase, removes leading and trailing spaces, and even replaces periods with underscores. Let's understand it with a basic example.

#Create a data frame with inconsistent column names
student_df = pd.DataFrame({
    'Student.ID': (1, 2, 3),
    'Student Name': ('Sara', 'Hanna', 'Mathew'),
    'Student Gender': ('Female', 'Female', 'Male'),
    'Course*': ('Algebra', 'Data Science', 'Geometry'),
    'Grade': ('A', 'B', 'C')
})

#Clean the column names
clean_df = student_df.clean_names()
print(clean_df)

Production:

   student_id    student_name    student_gender        course    grade
0           1            Sara            Female       Algebra        A
1           2           Hanna            Female  Data Science        B
2           3          Mathew              Male      Geometry        C

2. Rename columns

Sometimes renaming columns not only improves our understanding of the data, but also improves its readability and consistency. Thanks to the rename_column() This feature makes this task effortless. A simple example showing the usefulness of this feature is as follows:

student_df = pd.DataFrame({
    'stu_id': (1, 2),
    'stu_name': ('Ryan', 'James'),
})
# Renaming the columns
student_df = student_df.rename_column('stu_id', 'Student_ID')
student_df =student_df.rename_column('stu_name', 'Student_Name')
print(student_df.columns)

Production:

Index(('Student_ID', 'Student_Name'), dtype="object")

3. Handling missing values

Missing values are a real pain when working with data sets. Fortunately, fill_missing() is useful in addressing these problems. Let's explore how to handle missing values using Pyjanitor with a practical example. First, we will create a dummy data frame and fill it with some missing values.

# Create a data frame with missing values
employee_df = pd.DataFrame({
    'employee_id': (1, 2, 3, 4, 5),
    'name': ('Ryan', 'James', 'Alicia'),
    'department': ('HR', None, 'Engineering'),
    'salary': (60000, 55000, None)
})

Now, let's see how Pyjanitor can help fill in these missing values:

# Replace missing 'department' with 'Unknown'
# Replace the missing 'salary' with the mean of salaries
employee_df = employee_df.fill_missing({
    'department': 'Unknown',
    'salary': employee_df('salary').mean(),
})
print(employee_df)

Production:

   employee_id     name   department   salary
0            1     Ryan           HR  60000.0
1            2    James      Unknown  55000.0
2            3   Alicia  Engineering  57500.0

In this example, the employee's department 'James' is replaced by 'A stranger'and the salary of 'Alicia' is replaced by the average of “Ryan” and 'James' salaries. You can use several strategies to handle missing values, such as pass forward, pass backward, or fill in a specific value.

4. Filter rows and select columns

Filtering rows and columns is a crucial task in data analysis. Pyjanitor simplifies this process by providing functions that allow you to select columns and filter rows based on specific conditions. Suppose you have a data frame containing student records and you want to filter out the students (rows) whose grades are less than 60. Let's explore how Pyjanitor helps us achieve this.

# Create a data frame with student data
students_df = pd.DataFrame({
    'student_id': (1, 2, 3, 4, 5),
    'name': ('John', 'Julia', 'Ali', 'Sara', 'Sam'),
    'subject': ('Maths', 'General Science', 'English', 'History''),
    'marks': (85, 58, 92, 45, 75),
    'grade': ('A', 'C', 'A+', 'D', 'B')
})

# Filter rows where marks are less than 60
filtered_students_df = students_df.query('marks >= 60')
print(filtered_students_df)

Production:

   student_id    name  subject  marks grade
0           1    John     Math     85     A
2           3   Lucas  English     92    A+
4           5  Sophia     Math     75     B

Now suppose you also want to output only specific columns, such as just the name and ID, instead of your entire data. Pyjanitor can also help you do this as follows:

# Select specific columns
selected_columns_df = filtered_students_df.loc(:,('student_id', 'name'))

Production:

   student_id    name  
0           1    John    
2           3   Lucas 
4           5  Sophia

5. Chaining methods

With Pyjanitor's method chaining feature, you can perform multiple operations in a single line. This ability stands out as one of its best features. To illustrate, let's consider a data frame containing data about cars:

# Create a data frame with sample car data
cars_df =pd.DataFrame ({
    'Car ID': (101, None, 103, 104, 105),
    'Car Model': ('Toyota', 'Honda', 'BMW', 'Mercedes', 'Tesla'),
    'Price ($)': (25000, 30000, None, 40000, 45000),
    'Year': (2018, 2019, 2017, 2020, None)
})
print("Cars Data Before Applying Method Chaining:")
print(cars_df)

Production:

Cars Data Before Applying Method Chaining:
   Car ID Car Model  Price ($)    Year
0   101.0    Toyota    25000.0  2018.0
1     NaN     Honda    30000.0  2019.0
2   103.0       BMW        NaN  2017.0
3   104.0  Mercedes    40000.0  2020.0
4   105.0     Tesla    45000.0     NaN

Now we see that the data frame contains missing values and inconsistent column names. We can resolve this by performing operations sequentially, like clean_names(), rename_column()and, dropna()etc on several lines. Alternatively, we can chain these methods (performing multiple operations on a single line) for a smooth workflow and cleaner code.

# Chain methods to clean column names, drop rows with missing values, select specific columns, and rename columns
cleaned_cars_df = (
  cars_df
  .clean_names()  # Clean column names
  .dropna()  # Drop rows with missing values
  .select_columns(('car_id', 'car_model', 'price')) #Select columns
  .rename_column('price', 'price_usd')  # Rename column
)

print("Cars Data After Applying Method Chaining:")
print(cleaned_cars_df)

Production:

Cars Data After Applying Method Chaining:
   car_id car_model  price_usd 
0   101.0    Toyota  25000 
3   104.0  Mercedes  40000

The following operations have been performed in this pipeline:

clean_names() The function clears the column names.
dropna() The function removes rows with missing values.
select_columns() The function selects specific columns which are 'car_id', 'car_model' and 'price'.
rename_column() The function changes the name of the column 'price' to 'price_usd'.

Ending

So to conclude, Pyjanitor proves to be a magical library for anyone working with data. It offers many more features than those discussed in this article, such as encoding categorical variables, getting features and labels, identifying duplicate rows, and much more. All these advanced features and methods can be explored on your documentationThe deeper you dig into its features, the more you will be amazed by its powerful functionality. Finally, enjoy manipulating your data with Pyjanitor.

Kanwal Mehreen Kanwal is a machine learning engineer and technical writer with a deep passion for data science and the intersection of ai and medicine. She is the co-author of the eBook “Maximizing Productivity with ChatGPT.” As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She is also recognized as a Teradata Diversity in tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change and founded FEMCodes to empower women in STEM fields.

A Beginner's Guide to Data Cleaning with Pyjanitor

Technical Terrence Team

Nike prepares line of $100 sneakers as shares plunge in worst decline on record By Reuters

Leave a Reply Cancel reply

Recommended.

Ethereum Price Prediction: Analyst anticipates ETH to drop to $2200, sharks join Mpeppe's wave after 150% rally

An American Airlines flight was forced to turn around for a ridiculous reason

Educator Perspectives: Priorities for 2023-2024

Bitcoin (BTC) is on the verge of breaking $65,000, on-chain data shows

Atlassian price target moderated on concerns over data center outlook: UBS

Categories

Important Links

A Beginner's Guide to Data Cleaning with Pyjanitor

What is Pyjanitor?

Starting

1. Cleaning column names

2. Rename columns

3. Handling missing values

4. Filter rows and select columns

5. Chaining methods

Ending

Related

Technical Terrence Team

Nike prepares line of $100 sneakers as shares plunge in worst decline on record By Reuters

Leave a Reply Cancel reply

Recommended.

Ethereum Price Prediction: Analyst anticipates ETH to drop to $2200, sharks join Mpeppe's wave after 150% rally

An American Airlines flight was forced to turn around for a ridiculous reason

Educator Perspectives: Priorities for 2023-2024

Bitcoin (BTC) is on the verge of breaking $65,000, on-chain data shows

Atlassian price target moderated on concerns over data center outlook: UBS

Categories

Important Links

Get daily news updates to your inbox!