When do you join on your own? A useful trick | by Saikat Dutta | March 2024

Intermediate SQL for the transition from ETL developer to data engineer

There is nothing called self-join in SQL. Listen to me.

Often, data analysts and engineers need to find patterns in data that are not obvious. However, you can generate insights and identify patterns by using common SQL practices such as Self Join.

Many beginners often have difficulty understanding Self Join. They confuse Self Join in SQL with another command like Inner or Left Join. But Self Join is not a keyword in SQL. It is not a command in SQL either.

The self join is like a normal join (inner/left/right/outer) between two tables. However, in a self-join, the two tables are the same but act as different tables through their aliases.

Self-joining is often considered bad practice in data engineering. They say its use is risky. However, there are situations where using a self-join is practical and the best way to address the problem.

Let's look at some examples:

Hierarchical data:

Self-joins are useful for working with hierarchical data. In an org chart, we can join a table to itself based on manager-employee relationships to find reports for employees, department heads, etc.

Let's generate some free data to test this.

create table employee
(
employee_id int,
employee_name varchar(10),
EmpSSN varchar(11),
manager_id int null,
city varchar(20)
);--Correct data
insert into employee values(1, 'Jack', '555-55-5555','','Kolkata');
insert into employee values (2, 'Joe', '555-56-5555',1,'Kolkata');
insert into employee values (3, 'Fred', '555-57-5555',2,'Dehli');
insert into employee values (4, 'Mike', '555-58-5555',2,'Kolkata');
insert into employee values (5, 'Cathy', '555-59-5555',2,'Dehli');
insert into employee values (6, 'Lisa', '555-70-5555',3,'Bangalore');

The details of an organization's employees along with their manager ID are stored here. We can use Self join to identify the manager of all the different employees.

select emp.*,isnull(mgr.employee_name,'Boss') as managerName from employee emp
left join employee mgr on emp.manager_id = mgr.employee_id

Here the query returns the name of the manager corresponding to each employee when joining the same employee table with itself in manager_id = Employee_id.

Returns admin name using autojoin, image by author

WARNING: Don't forget to use aliases for the employee table, to differentiate between two parts of the self-union. Additionally, the join column must be used correctly.

Similarly, we can even find the different levels of hierarchy by recursively joining the CTE as a self-join with itself.

WITH
EmployeeHierarchy AS (
SELECT
employee_id, employee_name, manager_id, 0 
AS
level
FROM
employee
WHERE
manager_id = 0
UNION ALL
SELECT
emp.employee_id, emp.employee_name, emp.manager_id, eh.level + 1
FROM
employee emp
JOIN
EmployeeHierarchy eh 
ON
emp.manager_id = eh.employee_id
)
SELECT
employee_id, employee_name, level
FROM
EmployeeHierarchy;

Products and categories:

This may be related to hierarchical data only, but this is a specific subset. Self-joins can be extremely useful for identifying all combinations of products, categories, and subcategories. In manufacturing this can provide components and sub-components, in e-commerce it can be used to source similar products or categories.

Let's learn it through an example:

Create a table and insert dummy data:

create table bom (item_id int, parent_id int null,description varchar(50), quantity int)INSERT INTO bom (item_id, parent_id, description, quantity)
VALUES (1, NULL, 'Widget (Main Assembly)', 1),
(2, 1, 'Gear A', 2),
(3, 1, 'Spring B', 4),
(4, 2, 'Screw C (Small)', 10),
(5, 2, 'Screw C (Large)', 5),
(6, 3, 'Nut D', 1);

We create a table with columns of item_id, parent_id, description and quantity. We have also inserted sample data from a manufacturing line, where 'Widget (main assembly)' is the main product and gears, bolts, nuts, etc. They are byproducts.

We can use self-join to identify the parent-child relationship, and recursive self-join can identify the entire product sequence.

Let's go over this with the query and results:

WITH recursive_bom AS (
SELECT item_id, parent_id, description, quantity, cast(description as nvarchar(255)) AS full_path
FROM bom
WHERE parent_id IS NULL  -- Starting point: Top-level items
UNION ALL
SELECT
b.item_id,
b.parent_id,
b.description,
b.quantity,
cast(CONCAT(rb.full_path, '.', b.description) as nvarchar(255))  AS full_path
FROM bom b
INNER JOIN recursive_bom rb ON b.parent_id = rb.item_id
)
SELECT item_id, description, quantity, full_path
FROM recursive_bom
ORDER BY full_path;

Sample data and results

Product sequence generation through self-joining. — Author's image

User segmentation:

In Business and Data Analytics, an important aspect is user segmentation. Users are often classified based on their purchasing behavior, their frequency of interaction with the company, etc. A self-join can be a great way to identify these patterns in the transaction data.

Let's consider the following example:

We need to identify returning customers within a specific time period (7 days) for an eCommerce business. Similar challenges can be found on the Internet, e.g. here.

Let's create a test table and insert some sample records into the table.

Tip: You can ask ChatGpt to generate the test data as needed.

Create a table and insert dummy data:

create table ecom_tran (
tranid int,
userid int,
created_date datetime,
itemname varchar(50)
)INSERT INTO ecom_tran (tranid, userid, created_date, itemname, price_paid)
VALUES
(1, 201, '2024-02-23 11:45:00', 'Running Shoes'),
(2, 202, '2024-02-24 10:00:00', 'Yoga Mat'),
(3, 203, '2024-02-26 14:10:00', 'Water Bottle'),
(4, 204, '2024-02-27 09:30:00', 'Gym Bag'),
(5, 205, '2024-02-28 12:00:00', 'Protein Powder'),
(6, 201, '2024-02-29 15:15:00', 'Phone Case'),
(7, 206, '2024-03-01 10:45:00', 'Webcam'),
(8, 202, '2024-03-02 16:30:00', 'Pen Drive'),
(9, 207, '2024-03-04 12:00:00', 'Powerbank'),
(10, 203, '2024-03-05 09:00:00', 'Monitor'),
(11, 101, '2024-03-06 11:00:00', 'Mouse'),
(12, 102, '2024-03-07 14:45:00', 'Speaker'),  
(13, 103, '2024-03-08 10:10:00', 'Tablet'),
(14, 101, '2024-03-09 13:30:00', 'Headphones'),  
(15, 104, '2024-03-10 17:00:00', 'Book'),
(16, 102, '2024-03-11 08:20:00', 'Coffee Maker'), 
(17, 105, '2024-03-12 11:15:00', 'Smartwatch'),
(18, 101, '2024-03-13 15:45:00', 'Shirt'),  
(19, 103, '2024-03-14 12:30:00', 'Laptop')

Solution Approach:

In the sample table created, we have the column User ID, Transaction ID and Creation Date that are relevant to the challenge. Since we were asked to identify users who made at least 2 purchases in a 7-day period, we can consider the following approach:

Check how many different transactions users have made.
Combine each transaction with itself to identify all possible pairs of transactions made by the same user.
Calculate the difference in dates between the two combinations.
The date difference must be > 0 and < 7. This will ensure that only records where transactions occurred within 7 days are returned.
We may collect various user IDs to identify users who have returned transactions within 7 days.

This is a classic use case to consider autojoin along with non-equivalent join.

SELECT a.userid,
a.tranid AS id1,
a.created_date AS created_at1,
b.tranid AS id2,
b.created_date AS created_at2,
mod(DATEDIFF(dd,a.created_date,b.created_date))
FROM ecom_tran a
JOIN ecom_tran b 
ON a.userid=b.userid
AND a.tranid <> b.tranid
ORDER BY a.userid

The above query outputs all combinations of transactions made by the same users. We have achieved this by joining ecom_tran with itself, with the help of aliases, in the user ID column. This inner join ensures that only transactions from the same user are returned.

However, non-equivalent joining is enabled. a.tranid b.tranid This ensures that the same transactions are not repeated.

We also calculate the date difference between the two transactions.

Now, if we only filter out those where the date difference between the creation dates is > 0 and < 7, that will give us all transactions that occurred within 7 days of the same user. We can take a different value from the User ID column to simply identify users who made recurring purchases within 7 days.

Example of self-joining in user segmentation. — Author's image

Conclusion:

I hope you have a general understanding and intuition about how autojoin works in SQL. Although self-joins are not very intuitive to understand and use, there are specific use cases where they are indispensable.

I have covered just a few of the possible scenarios. However, this is enough to give you confidence when facing any SQL interview question. Even if the question may need an intuitive understanding to solve the challenges, these concepts will help identify which approach to use.

When do you join on your own? A useful trick | by Saikat Dutta | March 2024

Technical Terrence Team

SpaceX investor explains how to invest in space

Leave a Reply Cancel reply

Recommended.

The Verge’s 2023 home tech holiday gift guide

Elon Musk admits Twitter has too many ads, says a fix is coming – TechCrunch

Deciphering clinical abbreviations with privacy-protecting ML – Google AI Blog

The Royal Caribbean singer shares what she does on her day off

ONTO Wallet partners with Fiat24 to offer NFT-based digital banking services

Categories

Important Links

When do you join on your own? A useful trick | by Saikat Dutta | March 2024

Intermediate SQL for the transition from ETL developer to data engineer

Hierarchical data:

Products and categories:

User segmentation:

Related

Technical Terrence Team

SpaceX investor explains how to invest in space

Leave a Reply Cancel reply

Recommended.

The Verge’s 2023 home tech holiday gift guide

Elon Musk admits Twitter has too many ads, says a fix is ​​coming – TechCrunch

Deciphering clinical abbreviations with privacy-protecting ML – Google AI Blog

The Royal Caribbean singer shares what she does on her day off

ONTO Wallet partners with Fiat24 to offer NFT-based digital banking services

Categories

Important Links

Get daily news updates to your inbox!

Elon Musk admits Twitter has too many ads, says a fix is coming – TechCrunch