Let’s not get confused about Microsoft SQL Server vs Microsoft Windows Server. I’m writing here about windows of data when running a query, and the enhancements in SQL Server 2022 (because that’s the topic chosen by Glenn Berry (@GlennAlanBerry) this month)
Arguably, SQL has always had windows of data. Most expressions apply to a single row of values at a time – if I write DATEDIFF(day,DateTimeCol,sysdatetime()), then this is being applied to the row of data that it’s appearing on. It’s seemingly unaware of anything else. Whereas if I write SUM(Amount) then it’s either referring to the whole data set (if we don’t have a GROUP BY clause), or the set of rows that have the same values in the columns specified by the GROUP BY clause. Back at the turn of the century (ie, before SQL Server 2005), these were three windows we had. The row itself, the group as specified by the GROUP BY clause, or the whole data set. And to use these windows of data with aggregates, the data set would shrink so that it was a single row per window – either a dataset with a single row (without the GROUP BY), or a row per distinct value in the columns and expressions specified in the GROUP BY clause. It’s hard to think of these as windows of data at all.
Then SQL Server 2005 came out and we were introduced to the OVER clause, and windowing functions like ROW_NUMBER(), RANK(), DENSE_RANK(), and NTILE(). The OVER clause let us defined windows by using PARTITION BY, and for those new windowing functions we could do so by using ORDER BY within the OVER clause. We could also use our old aggregate functions on separate windows of data without changing the number of rows in the dataset. We could write “COUNT(*) OVER ()” to have an extra column in our dataset telling us how many rows we had. It would be the same value all the way down, but it was tremendously useful, and by using PARTITION BY it would be the number of rows within that window. We found lots of reasons to love the things we could do with the OVER clause. I gave many a conference session around Australia about them.
SQL Server 2012 brought us even more windowing functions, such as LAG and LEAD, which simplified things a lot. We got the ability to include an ORDER BY clause to aggregates to give us rolling aggregates. And we’d start to have code like:
1 2 3 4 5 |
SELECT COUNT(*) OVER (ORDER BY OrderDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RollingSum, SUM(Amount) OVER (ORDER BY OrderDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RollingSum, AVG(Amount) OVER (ORDER BY OrderDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS RollingAvg FROM dbo.Orders; |
This was fine, but it did start to become a little cumbersome.
Enter SQL Server 2022. Not only do we get the ability to ignore nulls now, making it easy to get the last non-null value from a list, but we also get a WINDOW clause – part of the SELECT query itself, dropping in between the HAVING clause and the ORDER BY clause, allowing us to predefine those OVER clause segments.
So now we can write:
1 2 3 4 5 6 |
SELECT COUNT(*) OVER ByRollingDate AS RollingSum, SUM(Amount) OVER ByRollingDate AS RollingSum, AVG(Amount) OVER ByRollingDate AS RollingAvg FROM dbo.Orders WINDOW ByRollingDate (ORDER BY OrderDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW); |
And we can still modify them further by adding extra sections:
1 2 3 4 5 6 7 8 9 10 11 |
SELECT COUNT(*) OVER ByRollingDate AS RollingSum, SUM(Amount) OVER ByRollingDate AS RollingSum, AVG(Amount) OVER ByRollingDate AS RollingAvg, COUNT(*) OVER ByRollingDatePerYear AS RollingSumByYear, SUM(Amount) OVER ByRollingDatePerYear AS RollingSumByYear, AVG(Amount) OVER ByRollingDatePerYear AS RollingAvgByYear FROM dbo.Orders WINDOW ByRollingDate (ORDER BY OrderDate ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW), ByRollingDatePerYear (ByRollingDate PARTITION BY Year(OrderDate)); |
I see this as really significant not only because the code becomes tidier, but because of how we consider the execution of the query. When someone writes a query, they don’t always give much consideration to the kinds of things that will be involved in running the query. By taking the the WINDOW definitions out of the SELECT clause, I think it helps the query writer appreciate the work that’s needed. And to anticipate the kind of operators they’ll see in the query plan.
When we write a WHERE clause, we understand there’s filtering to do, maybe using an Index Seek or a Filter. With JOINs, we can expect to see joining operators. With a GROUP BY clause, we understand that there will probably be aggregates, either Stream Aggregates on ordered data or Hash Aggregates on non-ordered data. I think we often feel like the expressions in the SELECT clause are worked out at the end of execution, but this isn’t really how it is. I’m sure you know that if you drop SELECT * into the mix, some very different index choices are probably going to be made. And if you drop an OVER clause in? Well, a bunch of work will be done to order and segment the data. This could be a very expensive piece of work. And if you have different windows defined slightly differently, that might cause a bunch more work to be done when executing that query.
I feel like defining windows in their own section of the query will help promote responsible coding. It will help people be more responsible with the OVER clause, and of course, write queries that simply look more elegant.
I think it’s fair to say I’m a fan of this new feature.
@robfarley.com@bluesky (previously @rob_farley@twitter)
This Post Has 2 Comments
Amazing article. I am working with SQL Server since 2005, teaching other developers SQL, also about windows, but never had so deep understanding of this idea up to this day. Thank you very much! Totally agree, and I’m a fan also.
I’m pleased it helped, Gerard.