7 Steps to Master Data Cleansing with Python and Pandas

Image by author

Pandas is the most used Python library for data analysis and manipulation. But the data you read from the source often requires a series of data cleaning steps before you can analyze it to gain valuable insights, answer business questions, or build machine learning models.

This guide breaks down the process of cleaning data with pandas into 7 practical steps. We will create a sample data set and follow the data cleaning steps.

Let us begin!

How to create a sample data frame

Link to Colab notebook

Before we start the actual data cleaning steps, let's create a pandas data frame with employee records. We will use Faker for synthetic data generation. So install it first:

If you want, you can follow the same example. You can also use a data set of your choice. Here is the code to generate 1000 records:

import pandas as pd
from faker import Faker
import random

# Initialize Faker to generate synthetic data
fake = Faker()

# Set seed for reproducibility
Faker.seed(42)

# Generate synthetic data
data = ()
for _ in range(1000):
    data.append({
        'Name': fake.name(),
        'Age': random.randint(18, 70),
        'Email': fake.email(),
        'Phone': fake.phone_number(),
        'Address': fake.address(),
        'Salary': random.randint(20000, 150000),
        'Join_Date': fake.date_this_decade(),
        'Employment_Status': random.choice(('Full-Time', 'Part-Time', 'Contract')),
        'Department': random.choice(('IT', 'Engineering','Finance', 'HR', 'Marketing'))
    })

Let's modify this data frame a little to introduce missing values, duplicate records, outliers, and more:

# Let's tweak the records a bit!
# Introduce missing values
for i in random.sample(range(len(data)), 50):
    data(i)('Email') = None

# Introduce duplicate records
data.extend(random.sample(data, 100))

# Introduce outliers
for i in random.sample(range(len(data)), 20):
    data(i)('Salary') = random.randint(200000, 500000)

Now let's create a data frame with these records:

# Create dataframe
df = pd.DataFrame(data)

Note that we set the seed to Faker and not the random module. So there will be some randomness in the records you generate.

Step 1: Understand the data

Step 0 is always understand the business question/problem you are trying to solve.. Once you know that, you can start working with the data you've read into your pandas data frame.

But before you can do anything meaningful with the data set, it's important to first get a high-level overview of the data set. This includes getting basic information about the different fields and the total number of records, inspecting the data frame header, and similar things.

Here we execute the info() method in data frame:

Output >>>

RangeIndex: 1100 entries, 0 to 1099
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Name               1100 non-null   object
 1   Age                1100 non-null   int64 
 2   Email              1047 non-null   object
 3   Phone              1100 non-null   object
 4   Address            1100 non-null   object
 5   Salary             1100 non-null   int64 
 6   Join_Date          1100 non-null   object
 7   Employment_Status  1100 non-null   object
 8   Department         1100 non-null   object
dtypes: int64(2), object(7)
memory usage: 77.5+ KB

And inspect the head of the data frame:

Output of df.head()

Step 2: Handling Duplicates

Duplicate records are a common problem that distorts analysis results. Therefore, we need to identify and remove all duplicate records to work only with the unique data records.

This is how we find all the duplicates in the data frame and then put them in place:

# Check for duplicate rows
duplicates = df.duplicated().sum()
print("Number of duplicate rows:", duplicates)

# Removing duplicate rows
df.drop_duplicates(inplace=True)

Output >>>
Number of duplicate rows: 100

Step 3: Handling missing data

Missing data is a common data quality issue in many data science projects. If you take a quick look at the result of the info() method from the previous step, you should see that the number of non-null objects is not identical for all fields and there are missing values in the email column. We'll get the exact count anyway.

To get the number of missing values in each column, you can run:

# Check for missing values
missing_values = df.isna().sum()
print("Missing Values:")
print(missing_values)

Output >>>
Missing Values:
Name                  0
Age                   0
Email                50
Phone                 0
Address               0
Salary                0
Join_Date             0
Employment_Status     0
Department            0
dtype: int64

If values are missing in one or more numerical columns, we can apply appropriate imputation techniques. But because the 'Email' field is missing, let's set the missing emails to a placeholder email like this:


# Handling missing values by filling with a placeholder
df('Email').fillna('unknown@example.com', inplace=True)

Step 4: Transform data

When you work on the data set, there may be one or more fields that do not have the expected data type. In our sample data frame, the 'Join_Date' field needs to be converted to a valid datetime object:

# Convert 'Join_Date' to datetime
df('Join_Date') = pd.to_datetime(df('Join_Date'))
print("Join_Date after conversion:")
print(df('Join_Date').head())

Output >>>
Join_Date after conversion:
0   2023-07-12
1   2020-12-31
2   2024-05-09
3   2021-01-19
4   2023-10-04
Name: Join_Date, dtype: datetime64(ns)

Because we have the joining date, it's actually more useful to have a “Years_Employed” column as shown:

# Creating a new feature 'Years_Employed' based on 'Join_Date'
df('Years_Employed') = pd.Timestamp.now().year - df('Join_Date').dt.year
print("New feature 'Years_Employed':")
print(df(('Join_Date', 'Years_Employed')).head())

Output >>>
New feature 'Years_Employed':
   Join_Date  Years_Employed
0 2023-07-12               1
1 2020-12-31               4
2 2024-05-09               0
3 2021-01-19               3
4 2023-10-04               1

Step 5: Clean text data

It is quite common to encounter inconsistently formatted string fields or similar problems. Cleaning text can be as simple as applying a case conversion or as difficult as writing a complex regular expression to get the string into the required format.

In the example data frame we have, we see that the 'Address' column contains many '\n' characters which make it difficult to read. So replace them with spaces like this:

# Clean address strings
df('Address') = df('Address').str.replace('\n', ' ', regex=False)
print("Address after text cleaning:")
print(df('Address').head())

Output >>>
Address after text cleaning:
0    79402 Peterson Drives Apt. 511 Davisstad, PA 35172
1     55341 Amanda Gardens Apt. 764 Lake Mark, WI 07832
2                 710 Eric Estate Carlsonfurt, MS 78605
3                 809 Burns Creek Natashaport, IA 08093
4    8713 Caleb Brooks Apt. 930 Lake Crystalbury, CA...
Name: Address, dtype: object

Step 6: Handling Outliers

If you scroll back up, you'll see that we set some of the values in the “Salary” column to be extremely high. These outliers must also be identified and handled appropriately so that they do not bias the analysis.

You will often want to keep in mind that converts a data point to an outlier (if it is incorrect data entry or if it is actually valid values and not outliers). You can then choose to handle them: remove records with outliers or get the subset of rows with outliers and analyze them separately.

Let's use the z score and find those salary values that are more than three standard deviations from the mean:

# Detecting outliers using z-score
z_scores = (df('Salary') - df('Salary').mean()) / df('Salary').std()
outliers = df(abs(z_scores) > 3)
print("Outliers based on Salary:")
print(outliers(('Name', 'Salary')).head())

Output >>>
Outliers based on Salary:
                Name  Salary
16    Michael Powell  414854
131    Holly Jimenez  258727
240  Daniel Williams  371500
328    Walter Bishop  332554
352     Ashley Munoz  278539

Step 7: Merge data

In most projects, the data you have can No is the data you will want to use for analysis. You need to find the most relevant fields to use and also merge data from other data frames to get more useful data that you can use for analysis.

As a quick exercise, create another related data frame and merge it with the existing data frame into a common column so the join makes sense. Merging in pandas works very similar to joins in SQL, so I suggest you try it as an exercise.

Ending

That's all for this tutorial! We created a sample data frame with records and worked through the various data cleaning steps. Below is an overview of the steps: understand the data, handle duplicates, missing values, transform data, clean text data, handle outliers, and merge data.

If you want to learn everything about data management with pandas, check out 7 Steps to Master Data Management with Pandas and Python.

twitter.com/balawc27″ rel=”noopener”>Bala Priya C. is a developer and technical writer from India. He enjoys working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. He likes to read, write, code and drink coffee! Currently, he is working to learn and share his knowledge with the developer community by creating tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource descriptions and coding tutorials.

7 Steps to Master Data Cleansing with Python and Pandas

Technical Terrence Team

Popular discount retailer shares bankruptcy news, closes stores

Leave a Reply Cancel reply

Recommended.

Ethereum Price Surpasses $3,100: Mega Wedge Breakout Is Imminent

The best sun protection items to take on a cruise

Our favorite Sony wireless headphones are on sale for a record price

When medical care is decided by algorithms, who wins?

Bitcoin's price rises by 1%, since this ICO gives free BTC holders

Categories

Important Links

7 Steps to Master Data Cleansing with Python and Pandas

How to create a sample data frame

Step 1: Understand the data

Step 2: Handling Duplicates

Step 3: Handling missing data

Step 4: Transform data

Step 5: Clean text data

Step 6: Handling Outliers

Step 7: Merge data

Ending

Related

Technical Terrence Team

Popular discount retailer shares bankruptcy news, closes stores

Leave a Reply Cancel reply

Recommended.

Ethereum Price Surpasses $3,100: Mega Wedge Breakout Is Imminent

The best sun protection items to take on a cruise

Our favorite Sony wireless headphones are on sale for a record price

When medical care is decided by algorithms, who wins?

Bitcoin's price rises by 1%, since this ICO gives free BTC holders

Categories

Important Links

Get daily news updates to your inbox!