Image by author
It is a widely held fact among data scientists that data cleaning constitutes a large proportion of our work time. However, it's also one of the least exciting parts. So this leads to a very natural question:
Is there a way to automate this process?
Automating any process is always easier said than done, since the steps to follow depend mainly on the specific project and objective. But there are always ways to automate at least some of the pieces.
This article aims to generate a pipeline with some steps to make sure our data is clean and ready to be used.
Data cleaning process
Before proceeding to generate the pipeline, we must understand which parts of the processes can be automated.
Since we want to create a process that can be used for almost any data science project, we first need to determine which steps are performed over and over again.
So when we work with a new data set, we usually ask ourselves the following questions:
- What format does the data come in?
- Does the data contain duplicates?
- Does the data contain missing values?
- What types of data does the data contain?
- Does the data contain outliers?
These 5 questions can easily be converted into 5 blocks of code to address each of the questions:
1.Data format
The data can come in different formats, such as JSON, CSV, or even XML. Each format requires its own data analyzer. For example, pandas provides read_csv for CSV files and read_json for JSON files.
By identifying the format, you will be able to choose the right tool to begin the cleaning process.
We can easily identify the format of the file we are dealing with using the path.plaintext function of the operating system library. Therefore, we can create a function that first determines which extension we have and then applies it directly to the corresponding parser.
2. Duplicates
It happens quite often that some rows of data contain exactly the same values as other rows, which we know as duplicates. Duplicate data can distort results and lead to inaccurate analyses, which is not a good thing.
That is why we must always ensure that there are no duplicates.
Pandas has us covered with the drop_duplicated() method, which deletes all duplicate rows from a data frame.
We can create a simple function that uses this method to remove all duplicates. If necessary, we add a columns input variable that adapts the function to remove duplicates based on a specific list of column names.
3. Missing values
Missing data is also a common problem when working with data. Depending on the nature of your data, we can simply remove observations that contain missing values, or we can fill these gaps using methods such as forward fill, reverse fill, or substituting with the column mean or median.
Pandas offers us the .fillna() and .dropna() methods to handle these missing values effectively.
The choice of how we handle missing values depends on:
- The type of values that are missing.
- The proportion of missing values relative to the total number of records we have.
Dealing with missing values is a fairly complex task to perform and usually one of the most important. – you can learn more about it in the following article.
For our pipeline, we will first check the total number of rows that present null values. If only 5% of them or less are affected, we will delete these records. In case there are more rows with missing values, we will check column by column and proceed with:
- Imputing the median value.
- Generate a warning to investigate further.
In this case, we are evaluating missing values with a hybrid human validation process. As you already know, evaluating missing values is a crucial task that cannot be overlooked.
When working with regular data types, we can proceed to transform the columns directly with the pandas .astype() function, so you could modify the code to generate regular conversations.
Otherwise, it is often too risky to assume that a transformation will go smoothly when working with new data.
5. Deal with outliers
Outliers can significantly affect the results of your data analysis. Techniques for handling outliers include setting thresholds, limiting values, or using statistical methods such as Z-score.
To determine if we have outliers in our data set, we use a common rule and consider any record outside the following range as an outlier. (Q1 – 1.5 * IQR, Q3 + 1.5 * IQR)
Where IQR represents the interquartile range and Q1 and Q3 are the first and third quartiles. Below you can see all the previous concepts displayed in a box plot.
Image by author
To detect the presence of outliers, we can easily define a function that checks which columns have values that are outside the above range and generates a warning.
Final thoughts
Data cleansing is a crucial part of any data project; However, it is also usually the most boring and time-consuming phase. That's why this article effectively summarizes a comprehensive approach into a practical 5-step process to automate data cleansing using Python and.
The process is not just about implementing code. It integrates reflective decision-making criteria that guide the user in managing different data scenarios.
This combination of automation with human supervision ensures efficiency and accuracy, making it a solid solution for data scientists looking to optimize their workflow.
You can go check my full code below. GitHub repository.
Joseph Ferrer He is an analytical engineer from Barcelona. He graduated in physical engineering and currently works in the field of data science applied to human mobility. He is a part-time content creator focused on data science and technology. Josep writes about all things ai, covering the application of the ongoing explosion in this field.