Image generated with the Segmind SSD-1B model
Are you excited to start analyzing data using SQL? Well, you might have to wait a little. But why?
Data in database tables can often be confusing. Your data may contain missing values, duplicate records, outliers, inconsistent data entries, and more. Therefore, it is very important to clean the data before it can be analyzed using SQL.
When you are learning SQL, you can activate database tables, modify them, update and delete records as you want. But in practice this almost never happens. You may not have permission to modify tables, update and delete records. But you will have read access to the database and will be able to run a bunch of SELECT queries.
In this tutorial, we will create a database table, fill it with records, and see how we can clean the data with SQL. Let's start coding!
For this tutorial, let's create a employees
table like this:
-- Create the employees table
CREATE TABLE employees (
employee_id INT PRIMARY KEY,
employee_name VARCHAR(50),
salary DECIMAL(10, 2),
hire_date VARCHAR(20),
department VARCHAR(50)
);
Next, let's insert some dummy sample records into the table:
-- Insert 20 sample records
INSERT INTO employees (employee_id, employee_name, salary, hire_date, department) VALUES
(1, 'Amy West', 60000.00, '2021-01-15', 'HR'),
(2, 'Ivy Lee', 75000.50, '2020-05-22', 'Sales'),
(3, 'joe smith', 80000.75, '2019-08-10', 'Marketing'),
(4, 'John White', 90000.00, '2020-11-05', 'Finance'),
(5, 'Jane Hill', 55000.25, '2022-02-28', 'IT'),
(6, 'Dave West', 72000.00, '2020-03-12', 'Marketing'),
(7, 'Fanny Lee', 85000.50, '2018-06-25', 'Sales'),
(8, 'Amy Smith', 95000.25, '2019-11-30', 'Finance'),
(9, 'Ivy Hill', 62000.75, '2021-07-18', 'IT'),
(10, 'Joe White', 78000.00, '2022-04-05', 'Marketing'),
(11, 'John Lee', 68000.50, '2018-12-10', 'HR'),
(12, 'Jane West', 89000.25, '2017-09-15', 'Sales'),
(13, 'Dave Smith', 60000.75, '2022-01-08', NULL),
(14, 'Fanny White', 72000.00, '2019-04-22', 'IT'),
(15, 'Amy Hill', 84000.50, '2020-08-17', 'Marketing'),
(16, 'Ivy West', 92000.25, '2021-02-03', 'Finance'),
(17, 'Joe Lee', 58000.75, '2018-05-28', 'IT'),
(18, 'John Smith', 77000.00, '2019-10-10', 'HR'),
(19, 'Jane Hill', 81000.50, '2022-03-15', 'Sales'),
(20, 'Dave White', 70000.25, '2017-12-20', 'Marketing');
If you can notice, I have used a small set of first and last names to sample and construct the name field for the records. However, you can be more creative with discs.
Note: All queries in this tutorial are for mysql. But you are free to use the RDBMS of your choice.
Missing values in data records are always a problem. Then you have to handle them accordingly.
A naive approach is to discard all records that contain missing values for one or more fields. However, you should not do this unless you are sure that there is no better way to handle missing values.
In it employees
table, we see that there is a NULL value in the 'department' column (see Employee_id row 13) indicating that the field is missing:
You can use the MERGE() function to use the string 'Unknown' for the NULL value:
SELECT
employee_id,
employee_name,
salary,
hire_date,
COALESCE(department, 'Unknown') AS department
FROM employees;
Running the above query you should get the following result:
Duplicate records in a database table can distort analysis results. We have chosen Employee_id as primary key in our database table. Therefore, we will not have repeated employee records in the employee_data
table.
You can still use the SELECT DISTINCT statement:
SELECT DISTINCT * FROM employees;
As expected, the result set contains all 20 records:
If you notice, the 'hire_date' column is currently VARCHAR and not a date type. To make it easier to work with dates, it is helpful to use the STR_TO_DATE() work like this:
SELECT
employee_id,
employee_name,
salary,
STR_TO_DATE(hire_date, '%Y-%m-%d') AS hire_date,
department
FROM employees;
Here, we have only selected the 'hire_date' column among others and have not performed any operation on the date values. Therefore, the result of the query should be the same as the previous query.
But if you want to perform operations like adding a clearing date to the values, this feature can be useful.
Outliers in one or more numerical fields can bias the analysis. Therefore, we must look for and remove outliers to filter out data that is not relevant.
But deciding which values constitute outliers requires domain knowledge and data that uses both domain knowledge and historical data.
In our example, let's say that know that the 'salary' column has an upper limit of 100000. Therefore, any entry in the 'salary' column can be at most 100000. And entries greater than this value are outliers.
We can verify these records by executing the following query:
SELECT *
FROM employees
WHERE salary > 100000;
As you can see, all entries in the “salary” column are valid. So the result set is empty:
Inconsistent data entries and formatting are quite common, especially in date and string columns.
In it employees
table, we see that the record for employee 'bob johnson' is not in the title case.
But for consistency, let's select all names formatted in upper and lower case. You have to use the CONCAT() work in conjunction with SUPERIOR() and SUBCADE() like:
SELECT
employee_id,
CONCAT(
UPPER(SUBSTRING(employee_name, 1, 1)), -- Capitalize the first letter of the first name
LOWER(SUBSTRING(employee_name, 2, LOCATE(' ', employee_name) - 2)), -- Make the rest of the first name lowercase
' ',
UPPER(SUBSTRING(employee_name, LOCATE(' ', employee_name) + 1, 1)), -- Capitalize the first letter of the last name
LOWER(SUBSTRING(employee_name, LOCATE(' ', employee_name) + 2)) -- Make the rest of the last name lowercase
) AS employee_name_title_case,
salary,
hire_date,
department
FROM employees;
When we talked about outliers, we mentioned that we would like the upper limit in the “salary” column to be 100,000 and we consider any salary entry greater than 100,000 as an outlier.
But it's also true that you don't want any negative values in the “salary” column. You can then run the following query to validate that all employee records contain values between 0 and 100000:
SELECT
employee_id,
employee_name,
salary,
hire_date,
department
FROM employees
WHERE salary < 0 OR salary > 100000;
As you can see, the result set is empty:
Deriving new columns is essentially not a data cleansing step. However, in practice, you may need to use existing columns to derive new columns that are more useful in the analysis.
For example, him employees
The table contains a column 'hire_date'. A more useful field is perhaps a 'years_of_service' column that indicates how long an employee has been with the company.
The following query finds the difference between the current year and the year value in 'hire_date' to calculate 'years_of_service':
SELECT
employee_id,
employee_name,
salary,
hire_date,
department,
YEAR(CURDATE()) - YEAR(hire_date) AS years_of_service
FROM employees;
You should see the following result:
As with other queries we have run, this does not modify the original table. To add new columns to the original table, you must have permissions to ALTER the database table.
I hope you understand how relevant data cleaning tasks can improve data quality and facilitate more relevant analyses. You've learned how to check for missing values, duplicate records, inconsistent formats, outliers, and more.
Try creating your own relational database table and running some queries to perform common data cleansing tasks. Next, learn about SQL for data visualization.
Bala Priya C. is a developer and technical writer from India. He enjoys working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. He likes to read, write, code and drink coffee! Currently, he is working to learn and share his knowledge with the developer community by creating tutorials, how-to guides, opinion pieces, and more.