Some time ago I was writing some windowing functions on a set of data. Basically I was looking for the last date an event had occurred for each type of event. Let's illustrate with an example:
```sql CREATE TABLE dbo.TestOver (ID INT IDENTITY(1,1) PRIMARY KEY NOT NULL ,[Group] CHAR(1) NOT NULL ,Value INT NOT NULL);
INSERT INTO dbo.TestOver([Group],Value) VALUES (‘A’,1) ,(‘A’,2) ,(‘A’,3) ,(‘A’,4) ,(‘B’,5) ,(‘B’,6) ,(‘B’,7) ,(‘B’,8) ,(‘B’,9); ```
Using the data above, I need to find the value 4 for group A and the value 9 for group B. I first wrote the following T-SQL statement to retrieve the data:
SELECT DISTINCT [Group], MAX(Value) OVER (PARTITION BY [Group] ORDER BY Value)
The results are of course incorrect. A little baffled why this was the cause, I changed the ORDER BY to descending which gave me the results I wanted.
I really didn't think twice over this, until I joined the session Powerful T-SQL Improvements that Reduce Query Complexity by Hugo Kornelis (blog | twitter) on the SQL Server Days. I learned two things.
- You don't need to specify the ORDER BY.
In SQL Server 2005, the OVER clause was introduced and it simplified some aggregations like the one we're doing here. When using the ranking window functions the ORDER BY clause is mandatory, but when using a regular aggregate window function the ORDER BY clause is not allowed. This gives us the following T-SQL which is the perfect solution for our problem here:
SELECT DISTINCT [Group], MAX(Value) OVER (PARTITION BY [Group])
To be honest, I completely forgot aggregate functions could be used this way. The PARTITION BY clause is optional as well, so you can have a completely empty OVER clause.
- When you do specify the ORDER BY, defaults come into play.
Starting from SQL Server 2012, the T-SQL windowing functions and the OVER clause were greatly enhanced. Suddenly you can specify an ORDER BY for the aggregate windowing functions (which I did in the first attempts, remember?). However, if you specify an ORDER BY clause but no ROW or RANGE clause, SQL Server will apply the following defaults: RANGE UNBOUNDED PRECEDING as the lower limit and CURRENT ROW for the upper limit of the window. When Hugo explained this, I had my “Eureka” moment (or rather my “How could I have been this stupid?” moment). Because of these defaults, the MAX aggregate was calculated over the wrong windows! Let's illustrate the concept for group A:
Because of the defaults, the first window is limited to only one row. This means the MAX aggregate will return the value 1. In the second window, two rows are included and MAX will return 2 and so on. By reversing the sort order, the value 4 will always be included in the windows, so MAX will return the correct answers. However dropping the ORDER BY is in my opinion the cleanest option to solve the problem.
Learn your T-SQL syntax and be aware of the defaults! Hugo also mentioned that ROWS
might will have better performance than RANGE, so you better always specify your window frames to avoid the default.
Update: I was contacted by the amazing Rob Farley who told me that ROWS will beat RANGE any day of the week and that you should always specify ROWS unless you really need RANGE.