![SQL Grouping and Partitioning Scenarios: When and How to Combine Data in Data Science](https://technicalterrence.com/wp-content/uploads/2024/01/Grouping-and-Partitioning-Scenarios-by-SQL-When-and-How-to.png)
Image by technology-sql-illustration_21743443.htm#query=sql&position=14&from_view=search&track=sph&uuid=27f89b21-e929-4fca-80ed-4c8c1cefdaa0″ target=”_blank” rel=”noopener”>freepik
SQL (Structured Query Language) is a programming language used to manage and manipulate data. This is why SQL queries are very essential to interact with databases in a structured and efficient manner.
Grouping in SQL serves as a powerful tool for organizing and analyzing data. It helps extract meaningful information and summaries from complex data sets. The best use case for clustering is to summarize and understand the characteristics of the data, thereby helping businesses in analytical and reporting tasks.
We usually have many requirements where we need to combine the records of the data set with common data to calculate the statistics on the group. Most of these cases can be generalized to common scenarios. These scenarios can be applied whenever a similar type requirement arises.
The GROUP BY clause in SQL is used to
- group data in some columns
- reduce the group to a single row
- perform aggregation operations on other columns in the groups.
Grouping Column = The grouping column value must be the same for all rows in the group.
Aggregation Column = The values in the Aggregation column are usually different on which a function like sum, maximum, etc. is applied.
The Aggregation column should not be the Grouping column.
Scenario 1: Grouping to find the sum of the Total
Let's say we want to calculate the total sales for each category in the sales table.
Therefore, we will group by category and aggregate the individual sales in each category.
select category,
sum(amount) as sales
from sales
group by category;
Grouping column = category
Aggregation column = quantity
Aggregation function = sum()
category | sales |
toys | 10,700 |
books | 4,200 |
gym equipment | 2,000 |
stationary | 1,400 |
Scenario 2: Grouping to find Count
Let's say we want to calculate the employee count in each department.
In this case, we will group by department and calculate the employee count in each department.
select department,
count(empid) as emp_count
from employees
group by department;
Grouping column = department
Aggregation column = empid
Aggregation function = count
department | emp_count |
finance | 7 |
marketing | 12 |
technology | twenty |
Scenario 3: Grouping to find the average
Let's say we want to calculate the average salary of employees in each department.
Similarly, we will group them again by department and calculate the average salaries of employees in each department separately.
select department,
avg(salary) as avg_salary
from employees
group by department;
Grouping column = department
Aggregation column = salary
Aggregation function = average
department | average salary |
---|---|
finance | 2,500 |
marketing | 4,700 |
technology | 10,200 |
Scenario 4: Grouping to find Maximum/Minimum
Let's say we want to calculate the highest salary of the employees in each department.
We will group the departments and calculate the maximum salary in each department.
select department,
max(salary) as max_salary
from employees
group by department;
Grouping column = department
Aggregation column = salary
Aggregation function = max.
department | max_salary |
---|---|
finance | 4,000 |
marketing | 9,000 |
technology | 12,000 |
Scenario 5: Clustering to find duplicates
Let's say we want to find the same or duplicate customer names in our database.
We will group by customer name and use count as the aggregation function. Additionally, we will use a clause on the aggregation function to filter out only those counts that are greater than one.
select name,
count(*) AS duplicate_count
from customers
group by name
having count(*) > 1;
Grouping column = name
Aggregation column = *
Aggregation function = count
Have = filter condition to be applied on the aggregation function
name | duplicate_count |
Jake Junning | 2 |
Maria Luna | 3 |
Peter Parker | 5 |
Oliver Queen | 2 |
The PARTITION BY clause in SQL is used to
- group/partition data into some columns
- Individual rows are preserved and No combined into one
- perform sort and aggregation operations on other columns in the group/partition.
Partition column = we select a column in which we group the data. The partition column data must be the same for each group. If not specified, the entire table is considered a single partition.
Sort Column = With each group created based on partition column, we will sort/sort the rows in the group
Sort Function = A sort function or aggregation function will be applied to the rows in the partition.
Scenario 6: Partition to find the highest record in a group
Let's say we want to calculate which book in each category has the highest sales, along with the amount the best-selling book has generated.
In this case, we cannot use a group by clause, because grouping will reduce the records in each category to a single row.
However, we need the record details such as book name, quantity, etc. along with the category to see which book has had the highest sales in each category.
select book_name, amount
row_number() over (partition by category order by amount) as sales_rank
from book_sales;
Partition column = category
Order column = quantity
Sort function = row_number()
This query gives us all the rows from the book_sales table, and the rows are sorted into each book category, with the best-selling book as row number 1.
Now we need to filter only row number 1 to get the best selling books in each category.
select category, book_name, amount from (
select category, book_name, amount
row_number() over (partition by category order by amount) as sales_rank
from book_sales
) as book_ranked_sales
where sales_rank = 1;
The above filter will give us only the best-selling books in each category along with the amount of sales each best-selling book has made.
category | name of the book | amount |
science | The hidden messages in the water | 20,700 |
fiction | Harry Potter | 50,600 |
spirituality | Autobiography of a yogi | 30,800 |
self help | The 5 love languages | 12,700 |
Scenario 7: Partition to find running totals in a group
Let's say we want to calculate the running total (running total) of the sale as they are sold. We need a separate running total for each product.
We will partition by product_id and sort the partition by date.
select product_id, date, amount,
sum(amount) over (partition by product_id order by date desc) as running_total
from sales_data;
Partition column = product_id
Order column = date
Sort function = sum()
Product ID | date | amount | running_total |
1 | 2023-12-25 | 3,900 | 3,900 |
1 | 2023-12-24 | 3,000 | 6,900 |
1 | 2023-12-23 | 2,700 | 9,600 |
1 | 2023-12-22 | 1,800 | 11,400 |
2 | 2023-12-25 | 2,000 | 2,000 |
2 | 2023-12-24 | 1,000 | 3,000 |
2 | 2023-12-23 | 7.00 | 3,700 |
3 | 2023-12-25 | 1,500 | 1,500 |
3 | 2023-12-24 | 4.00 | 1,900 |
Scenario 8: Partition to compare values within a group
Let's say we want to compare each employee's salary to the average salary in their department.
Then we will divide the employees according to the department and find the average salary of each department.
The average can easily be subtracted from the employee's individual salary to calculate whether the employee's salary is higher or lower than the average.
select employee_id, salary, department,
avg(salary) over (partition by department) as avg_dept_sal
from employees;
Partition column = department
Orders column = no order
Rank function = average()
employee ID | salary | department | avg_dept_sal |
1 | 7,200 | finance | 6,400 |
2 | 8,000 | finance | 6,400 |
3 | 4,000 | finance | 6,400 |
4 | 12,000 | technology | 11,300 |
5 | 15,000 | technology | 11,300 |
6 | 7,000 | technology | 11,300 |
7 | 4,000 | marketing | 5,000 |
8 | 6,000 | marketing | 5,000 |
Scenario 9: Partition to divide the results into equal groups
Let's say we want to divide employees into 4 equal (or almost equal) groups based on their salary.
We will then derive another logical column Tile_id, which will have the numerical ID of each employee group.
Groups will be created based on salary: the first group of tiles will have the highest salary, and so on.
select employee_id, salary,
ntile(4) over (order by salary desc) as tile_id
from employees;
Partition column = no partition: the entire table is in the same partition
Orders column = salary
Sort function = ntile()
employee ID | salary | tile_id |
4 | 12,500 | 1 |
eleven | 11,000 | 1 |
3 | 10,500 | 1 |
1 | 9,000 | 2 |
8 | 8,500 | 2 |
6 | 8,000 | 2 |
12 | 7,000 | 3 |
5 | 7,000 | 3 |
9 | 6,500 | 3 |
10 | 6,000 | 4 |
2 | 5,000 | 4 |
7 | 4,000 | 4 |
Scenario 10: Partition to identify islands or gaps in data
Let's say we have a sequential product_id column and we want to identify gaps in it.
We will then derive another logical column island_id, which will have the same number if product_id is sequential. When an outage is identified in product_id, island_id is incremented.
select product_id,
row_number() over (order by product_id) as row_num,
product_id - row_number() over (order by product_id) as island_id,
from products;
Partition column = no partition: the entire table is in the same partition
Order column = product_id
Sort function = row_number()
Product ID | row_number | island_id |
1 | 1 | 0 |
2 | 2 | 0 |
4 | 3 | 1 |
5 | 4 | 1 |
6 | 5 | 1 |
8 | 6 | 2 |
9 | 7 | 2 |
Group By and Partition By are used to solve many problems like:
Summary information: Grouping allows you to aggregate data and summarize information in each group.
Pattern analysis: It helps identify patterns or trends within subsets of data, providing insights into various aspects of the data set.
Statistic analysis: Allows the calculation of statistical measures such as averages, counts, maximums, minimums and other aggregate functions within groups.
Data Cleaning: It helps identify duplicates, inconsistencies, or anomalies within groups, making data cleansing and quality improvement more manageable.
Cohort analysis: Useful in cohort-based analysis, tracking and comparing groups of entities over time, etc.
Breathe run the Help Codes Blog which mainly deals with SQL Cheat Sheets. I am a full stack developer and am interested in creating reusable assets.