Window functions are key to writing SQL code that is efficient and easy to understand. Knowing how they work and when to use them will unlock new ways to solve your reporting problems.
The goal of this article is to explain window functions in SQL step by step in an understandable way so that you don't have to rely solely on memorizing the syntax.
Here's what we'll cover:
- An explanation of how you should view window functions.
- Review many examples in increasing difficulty.
- Look at a specific real case scenario to put our learnings into practice.
- Review what we have learned
Our data set is simple, six rows of income data for two regions in the year 2023.
If we take this data set and run a GROUP BY
sum over the income of each region, it would be clear what happens, right? It would result in only two rows remaining, one for each region, and then the sum of the income:
The way I want you to see the window functions is very similar to this but, Instead of reducing the number of rows, the aggregation will run “in the background” and the values will be added to our existing rows.
First, an example:
SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER () as total_revenue
FROM
sales
Notice that we don't have any. GROUP BY
and our data set is left intact. And we were still able to get the sum of all the income. Before we dive deeper into how this worked, let's quickly talk about the entire syntax before we start developing our knowledge.
The syntax is as follows:
SUM((some_column)) OVER (PARTITION BY (some_columns) ORDER BY (some_columns))
Breaking down each section, this is what we have:
- An aggregation or window function:
SUM
,AVG
,MAX
,RANK
,FIRST_VALUE
- He
OVER
keyword that says this is a window function - He
PARTITION BY
section, which defines the groups - He
ORDER BY
section that defines whether it is a running function (we'll cover this later)
Don't worry about what each of these means just yet, as it will become clear when we go over the examples. For now just know that to define a window function we will use the OVER
keyword. And as we saw in the first example, that is the only requirement.
Moving on to something really useful, we will now apply a group in our function. The initial calculation will be retained to show you that We can execute more than one window function at a time.which means that we can perform different aggregations at the same time in the same query, without needing to perform subqueries.
SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER (PARTITION BY region) as region_total,
SUM(revenue) OVER () as total_revenue
FROM sales
What did you say, we use the PARTITION BY
to define our groups (windows) that are used by our aggregation function. So, keeping our data set intact, we have:
- The total income of each region.
- The total revenue for the entire data set.
We are also not restricted to a single group. Similar to GROUP BY
We can divide our data into Region and Quarter, for example:
SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER (PARTITION BY
region,
date_trunc('quarter', date)
) AS region_quarterly_revenue
FROM sales
In the image we see that the only two data points for the same region and quarter were grouped together.
At this point I hope it is clear how we can see this as making a GROUP BY
but in situ, without reducing the number of rows in our data set. Of course, we don't always want that, but it's not that uncommon to see queries where someone groups data together and then joins it back into the original data set, complicating what could be a single window function.
Moving on to the ORDER BY
keyword. This defines a running window function. You've probably heard of current sum at some point in your life, but if not, we should start with an example to make everything clear.
SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER (ORDER BY id) as running_total
FROM sales
What's happening here is that we've gone, row by row, adding up the revenue with all the previous values. This was done following the order of the id
column, but it could have been any other column.
This specific example is not particularly useful because we are adding random months and two regions, but using what we have learned now we can find accumulated income by region. We do this by applying the running sum within each group.
SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER (PARTITION BY region ORDER BY date) as running_total
FROM sales
Please take the time to make sure you understand what happened here:
- For each region, we move forward month by month and add up the revenue.
- Once it's done for that region, we move on to the next, starting from scratch and moving forward again over the months.
It's quite interesting to note here that when we write these running functions we have the “context” of other rows. What I mean is that to get the cumulative sum at a point, we must know the previous values of the previous rows. This becomes more obvious when we learn that we can manually choose how many before/after rows we want to add.
SELECT
id,
date,
region,
revenue,
SUM(revenue) OVER (ORDER BY id ROWS BETWEEN 1 PRECEDING AND 2 FOLLOWING)
AS useless_sum
FROM
sales
For this query we specified that for each row we wanted to look at one row behind and two rows ahead, meaning we get the sum of that range. Depending on the problem you are solving, this can be extremely powerful as it gives you full control over how you group your data.
Finally, one last function I want to mention before moving on to a more complicated example is the RANK
function. This is asked a lot in interviews and the logic behind it is the same as everything we have learned so far.
SELECT
*,
RANK() OVER (PARTITION BY region ORDER BY revenue DESC) as rank,
RANK() OVER (ORDER BY revenue DESC) as overall_rank
FROM
sales
ORDER BY region, revenue DESC
As before, we use ORDER BY
to specify the order in which we will walk, row by row, and PARTITION BY
to specify our subgroups.
The first column sorts each row within each region, meaning we will have multiple “one sorts” in the data set. The second calculation is sorting across all rows in the data set.
This is a problem that appears from time to time and solving it in SQL requires extensive use of window functions. To explain this concept we will use a different data set containing timestamps and temperature measurements. Our goal is to fill in the rows that are missing temperature measurements with the last measured value.
This is what we hope to have in the end:
Before we start I just want to mention that if you are using Pandas you can solve this problem by simply running df.ffill()
but if you are in SQL the problem becomes a little more complicated.
The first step to solve this is to somehow group the NULLs with the previous non-null value. It may not be clear how we do this, but I hope it is clear that it will require a run function. Which means it is a function that will “walk row by row”, knowing when we reach a null value and when we reach a non-null value.
The solution is to use COUNT
and, more specifically, counting the values of temperature measurements. In the following query I run a normal current count and also a count of the temperature values.
SELECT
*,
COUNT() OVER (ORDER BY timestamp) as normal_count,
COUNT(temperature) OVER (ORDER BY timestamp) as group_count
from sensor
- In the first calculation we simply count each row more and more.
- In the second we count all the temperature values that we saw, without counting when it was NULL.
He normal_count
The column is useless for us, I just wanted to show what race COUNT
looked like. However, our second calculation, the group_count
It brings us closer to the solution to our problem!
Note that this way of counting ensures that the first value is counted, just before the NULLs start, and then every time the function sees a null value, nothing happens. This ensures that we are “labeling” each subsequent null with the same count we had when we stopped having measurements.
Moving on, we now need to copy the first value that was tagged to all the other rows within that same group. Which means that for the group 2
everything must be full of value 15.0
.
Can you think of a function now that we can use here? There is more than one answer to this, but again, I hope it is at least clear that we are now looking at a simple aggregation of windows with PARTITION BY
.
SELECT
*,
FIRST_VALUE(temperature) OVER (PARTITION BY group_count) as filled_v1,
MAX(temperature) OVER (PARTITION BY group_count) as filled_v2
FROM (
SELECT
*,
COUNT(temperature) OVER (ORDER BY timestamp) as group_count
from sensor
) as grouped
ORDER BY timestamp ASC
We can use both FIRST_VALUE
either MAX
to achieve what we want. The only goal is for us to obtain the first non-null value. Since we know that each group contains a non-null value and a bunch of null values, both functions work!
This example is a great way to practice window functions. If you want a similar challenge, try adding two sensors and then populate the values with the previous reading from that sensor. Something like this:
Could you do it? It doesn't use anything we haven't learned here so far.
Now we know everything we need to about how window functions work in SQL, so let's do a quick recap!
This is what we have learned:
- We use the
OVER
keyword for writing window functions - We use
PARTITION BY
to specify our subgroups (windows) - If we provide only the
OVER()
The keyword our window is the entire data set. - We use
ORDER BY
when we want to have a function running, which means our calculation progresses row by row - Window functions are useful when we want to group data to perform an aggregation but want to keep our data set as is.