In this co-authored blog post, Naomi and I will present six different solutions to the commonly experienced query problem of how to include an aggregated column’s related values – values from the same row in the group that provided the displayed value.
Background
It’s a fairly simple query to select the maximum date for each group in a GROUP BY query. You just throw in a Max() on the date column and then GROUP BY the other columns. For example,
T-SQL | |
1 2 3 4 5 | SELECT CustomerID, LastOrderDate = Max(OrderDate) FROM Orders GROUP BY CustomerID |
SELECT CustomerID, LastOrderDate = Max(OrderDate) FROM Orders GROUP BY CustomerID
But a common query need is to include other column values from the same row as the most recent date. Unfortunately, putting aggregates on other columns doesn’t work:
T-SQL | |
1 2 3 4 5 6 | SELECT CustomerID, LastOrderDate = Max(OrderDate), LastSubTotal = Max(SubTotal) -- wrong: won't be the subtotal from Max(OrderDate), but the greatest order total ever placed by this customer. FROM Orders GROUP BY CustomerID |
SELECT CustomerID, LastOrderDate = Max(OrderDate), LastSubTotal = Max(SubTotal) -- wrong: won't be the subtotal from Max(OrderDate), but the greatest order total ever placed by this customer. FROM Orders GROUP BY CustomerID
It’s almost like you need some kind of Max(OrderDate->SubTotal).
Note that the desired results are fairly easy in MS Access using the aggregate functions Last and First:
T-SQL | |
1 2 3 4 5 6 7 | SELECT CustomerID, LastOrderDate = Last(OrderDate), SubTotal = Last(SubTotal) FROM Orders GROUP BY CustomerID ORDER BY OrderDate |
SELECT CustomerID, LastOrderDate = Last(OrderDate), SubTotal = Last(SubTotal) FROM Orders GROUP BY CustomerID ORDER BY OrderDate
What happens in Access is that the Last() aggregate combines with the ORDER BY clause to select values from the row containing the most recent Order Date.
However, these functions are not available in SQL Server, likely because the query engine only does ordering at the very end, after GROUP BY and HAVING are processed. While this may seem like a defect, it is really a deliberate design trade-off, with compensations elsewhere in the engine.
Test Case
In order to have enough data to make results significant and also to let others run the same queries against the same data, we are going to use the AdventureWorks sample database.
In our scenario, we want to show customers’ account numbers along with the date and subtotal of their most recent order. We’re keeping it simple for clarity, but note that you can include as many columns as you like. In a later installment we’ll do these same queries with more columns and different scenarios, and will expand on some of the queries a bit. We’ll also discuss performance to help you select between them.
Note that the aggregate doesn’t have to be on a date. Perhaps you want to know the date of the largest dollar value order per customer. In this case, you’d be doing a Max() on the order value instead of a date, and you would most certainly run into duplicates. Keep this in mind when reading below and when developing your own queries.
The basic query we’ll be working with is:
T-SQL | |
1 2 3 4 5 6 | SELECT C.AccountNumber, LastOrderDate = Max(O.OrderDate) FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID |
SELECT C.AccountNumber, LastOrderDate = Max(O.OrderDate) FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID
What we want to do is also return the Subtotal value from the same row as the LastOrderDate.
Some Approaches to the Problem
We know of only a few basic approaches to the problem:
– Number and Filter — select all rows from OrderHeader, and number them in a way exploitable to correctly filter out unwanted rows. Observe that this won’t require hitting the table again, but it could use a lot of extra resources to temporarily materialize more data. Skipping over obvious poor choices such as a temp table with an ordered update or a cursor, options might be a windowing function (SQL 2005 and up) or a temp table with an identity column (SQL 2000).
– Simulate MS Access — simulate how MS Access works somehow, getting only the data we need in a single table access (seek or scan). Let’s imagine how the SQL engine already performs a Max. As it touches each row, it must store the intermediate value that is the best candidate so far for all the seen values. And it has to have one of these per group, per column. All that is needed is that every time it writes the Max() candidate into its temporary storage, it also writes the extra column we want into the same data.
How could we do this ourselves? We need to ORDER BY OrderDate but actually yield SubTotal. SubTotal clearly won’t sort in the correct order by itself. Remembering Max(OrderDate->SubTotal), what if it wasn’t by itself, and somehow contained the OrderDate, too, to make it sort properly? So here we have an idea about packing both columns into a single value that sorts correctly yet can still be used to get the SubTotal back out. When the engine does its Max() bit as usual on the date and stores the best candidate, it will also be forced to store the other value at the same time (which we can extract later).
So let’s use those ideas and turn them into real queries. Unless otherwise stated, queries will work with SQL Server 2000.
Key Search
1. Correlated Subquery
- Relies on unique order dates to avoid selecting multiple rows per customer.
- Correlated subqueries perform well with small datasets and badly with large datasets (though better with SQL 2005 and 2008 because they can sometimes get converted to real joins).
- Performance degrades geometrically as rows increase.
- Does not handle NULLs correctly.
T-SQL | |
1 2 3 4 5 6 7 8 9 10 11 12 13 | SELECT C.AccountNumber, O.OrderDate, O.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID WHERE O.OrderDate = ( SELECT Max(OrderDate) FROM SalesOrderHeader O2 WHERE O2.CustomerID = O.CustomerID ) |
SELECT C.AccountNumber, O.OrderDate, O.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID WHERE O.OrderDate = ( SELECT Max(OrderDate) FROM SalesOrderHeader O2 WHERE O2.CustomerID = O.CustomerID )
Another version of correlated subquery which sometimes performs much better is
T-SQL | |
1 2 3 4 5 6 7 8 9 10 11 12 13 | SELECT C.AccountNumber, O.OrderDate, O.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID WHERE O.OrderID = ( SELECT top 1 OrderID FROM SalesOrderHeader O2 WHERE O2.CustomerID = O.CustomerID order by O2.OrderDate Desc, O2.OrderID DESC) |
SELECT C.AccountNumber, O.OrderDate, O.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID WHERE O.OrderID = ( SELECT top 1 OrderID FROM SalesOrderHeader O2 WHERE O2.CustomerID = O.CustomerID order by O2.OrderDate Desc, O2.OrderID DESC)
Another variation of this query is (suggested by Alejandro Mesa (Hunchback) in this MSDN thread):
T-SQL | |
1 2 3 4 5 6 7 8 | SELECT C.AccountNumber, O1.OrderDate, O1.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O1 ON C.CustomerID = O1.CustomerID WHERE not exists (select 1 from Sales.SalesOrderHeader O2 where O2.CustomerID = O1.CustomerID and O1.OrderDate < O2.OrderDate) |
SELECT C.AccountNumber, O1.OrderDate, O1.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O1 ON C.CustomerID = O1.CustomerID WHERE not exists (select 1 from Sales.SalesOrderHeader O2 where O2.CustomerID = O1.CustomerID and O1.OrderDate < O2.OrderDate)
Note, that the first variation of this query may return duplicate rows, but the second will always return only one row per group (with the latest OrderID).
The third variation of this query with NOT EXISTS will produce duplicates in case of the same date, but can also be easily adjusted to return only one row with the max OrderID:
T-SQL | |
1 2 3 4 5 6 7 8 9 | SELECT C.AccountNumber, O1.OrderDate, O1.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O1 ON C.CustomerID = O1.CustomerID WHERE not exists (select 1 from Sales.SalesOrderHeader O2 where O2.CustomerID = O1.CustomerID and (O1.OrderDate < O2.OrderDate OR (O1.OrderDate = O2.OrderDate and O1.OrderID < O2.OrderID))) |
SELECT C.AccountNumber, O1.OrderDate, O1.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O1 ON C.CustomerID = O1.CustomerID WHERE not exists (select 1 from Sales.SalesOrderHeader O2 where O2.CustomerID = O1.CustomerID and (O1.OrderDate < O2.OrderDate OR (O1.OrderDate = O2.OrderDate and O1.OrderID < O2.OrderID)))
2. Derived Table
- Relies on unique order dates to avoid selecting multiple rows per customer.
- Performs better than the correlated subquery when querying a significant portion of all customers, but worse when only querying a few, since the last order date for every customer is always calculated.
- Performance degrades linearly as rows increase.
- Does not handle NULLs correctly.
T-SQL | |
1 2 3 4 5 6 7 8 9 10 11 12 | SELECT C.AccountNumber, O.OrderDate, O.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID INNER JOIN ( SELECT CustomerID, LastOrderDate = Max(OrderDate) FROM Sales.SalesOrderHeader GROUP BY CustomerID ) O2 ON O.CustomerID = O2.CustomerID AND O.OrderDate = O2.LastOrderDate |
SELECT C.AccountNumber, O.OrderDate, O.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID INNER JOIN ( SELECT CustomerID, LastOrderDate = Max(OrderDate) FROM Sales.SalesOrderHeader GROUP BY CustomerID ) O2 ON O.CustomerID = O2.CustomerID AND O.OrderDate = O2.LastOrderDate
Number and Filter
Windowing Functions
The windowing functions were introduced in SQL Server 2005, so the two queries below will not work in SQL Server 2000.
- Will always return one row per customer. Using RANK() instead of ROW_NUMBER() will yield the same results as queries 1 and 2.
- Easily supports not just 1 but n rows per group.
- Performance degrades fairly linearly as rows increase. The windowing functions are fast but huge rowsets can begin to bog them down.
3. Windowing Function – Two Stages
- A common table expression is not needed, but is cleaner. The CTE query can be made into a derived table if desired.
T-SQL | |
1 2 3 4 5 6 7 8 9 10 11 12 13 | WITH OrderData AS ( SELECT C.AccountNumber, O.OrderDate, O.SubTotal, Selector = ROW_NUMBER() OVER (PARTITION BY O.CustomerID ORDER BY O.OrderDate DESC) FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID ) SELECT AccountNumber, OrderDate, SubTotal FROM OrderData WHERE Selector = 1 |
WITH OrderData AS ( SELECT C.AccountNumber, O.OrderDate, O.SubTotal, Selector = ROW_NUMBER() OVER (PARTITION BY O.CustomerID ORDER BY O.OrderDate DESC) FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID ) SELECT AccountNumber, OrderDate, SubTotal FROM OrderData WHERE Selector = 1
4. Windowing Function – One Stage
T-SQL | |
1 2 3 4 5 6 7 8 9 | SELECT TOP 1 WITH TIES C.AccountNumber, O.OrderDate, O.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID ORDER BY ROW_NUMBER() OVER (PARTITION by O.CustomerID ORDER BY OrderDate DESC) |
SELECT TOP 1 WITH TIES C.AccountNumber, O.OrderDate, O.SubTotal FROM Sales.Customer C INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID ORDER BY ROW_NUMBER() OVER (PARTITION by O.CustomerID ORDER BY OrderDate DESC)
Simulate MS Access
5. Compound Key, aka Packed Values
- Uses only one table access with a simple GROUP BY.
- Naomi first learned this idea from a Russian SQL FAQ.
- Erik cooked up a similar idea in 2008.
- This example uses a varchar column to hold the packed values, but be aware that converting to and from binary is faster than char (plus harder to use and more error-prone) and binary aggregation is faster. The crucial part is being able to faithfully extract the data that was put in.
- This query has no problem with duplicates. It does effectively sort by the full compound value, so if multiple orders for the same customer can be on the same order date and you want some other column to determine which one is selected, it has to be included in the packed data.
- As given, this query does not handle NULLs at all. A few strategic Coalesces can fix that.
- There is no significant performance difference between unpacking one many-item compound value or many one-item compound values (though you’d always include all columns needed for ordering).
T-SQL | |
1 2 3 4 5 6 7 8 9 10 11 12 13 | SELECT CustomerID, LastOrderDate, LastSubTotal = Convert(money, Substring(DateAndSubtotal, 20, 25)) FROM ( SELECT CustomerID, LastOrderDate = Max(OrderDate), DateAndSubtotal = Max(convert(varchar(50), OrderDate, 121) + Convert(varchar(50), Subtotal)) FROM Sales.SalesOrderHeader GROUP BY CustomerID ) X |
SELECT CustomerID, LastOrderDate, LastSubTotal = Convert(money, Substring(DateAndSubtotal, 20, 25)) FROM ( SELECT CustomerID, LastOrderDate = Max(OrderDate), DateAndSubtotal = Max(convert(varchar(50), OrderDate, 121) + Convert(varchar(50), Subtotal)) FROM Sales.SalesOrderHeader GROUP BY CustomerID ) X
Isn’t that cool? I don’t advocate using it all the time, but when performance is bad and the trade-offs are worth it, do it.
In closing, thank you for visiting. We hope this is useful to you. You might like to see the next installment Including an Aggregated Column’s Related Values – Part 2, discussing this issue in further depth.
Erik
P.S. I would like to thank Naomi for co-authoring this blog post with me. She did the heavy lifting of actually writing queries and testing their performance; I just fleshed out the explanations a bit.
16 Comments
Excellent post and much needed. Thank you both.
I actually ask this question when I interview people and I can tell you that 95% of the people are NOT able to write any of these 5 versions. They can write a query that gives the max for one column for example but when you tell them that you want all the columns it becomes problematic
Would love to see the performance differences between the 5 queries
Agree, this is a really good post and one that deffinetely gets people. I had to work with my SQL Dev a few times getting over this hill.
Very nice Erik and Naomi!
good read. still haven’t quite wrapped my mind on how row_number works.
Excellent post, in Universal Thread I asked this matter and normally use the CTE approach, but is important to have things clarified as this post does.
Thanks
Great post.
For the Simulate MS Access compound key solution I found that using a style of 126 as in the following:
MAX(CONVERT(VARCHAR(50), OrderDate, 126)
produced inconsistent datetime lengths for the data I had.
I used the following:
MAX(CONVERT(VARCHAR(50), OrderDate, 20) and it worked ok for me.
I would like to thank you for a great post. One of the best I have seen on the internet.
cheers
Mark
Mark,
Thanks for the compliment!
About the lengths, that’s very interesting. I don’t understand how style 126 would give inconsistent lengths, since each value is padded out to its max length. Could you give some examples of values that resulted in different lengths?
I can at least say for certain that using style 20 will not give correct results because it loses precision by truncating the milliseconds portion of the given datetime.
Style 126: 2009-11-10T15:03:01.003
Style 020: 2009-11-10T15:03:01
Hi Emtucifor
Please look at these dates
2009-07-27 12:22:44.537 datetime
2009-07-27T12:22:44.537 Style 126
2005-11-21 00:00:00.000 datetime
2005-11-21T00:00:00 Style 126
When the time portion of the datetime is zero it will have a smaller length.
Zero Time is very common in migrated dates from a database with dates only.
Please advise if you get the same results or if I have made a mistake
regards
Mark
Thank you, Mark! You are correct. I will fix the blog post. For style 126, the milliseconds portion of the date is left off when it is zero. This is very disappointing because in Books Online it says:
CAST and CONVERT (Transact-SQL)
Style 126
Standard ISO8601
Input/Output yyyy-mm-ddThh:mm:ss.mmm (no spaces)
This led me to believe that it would always display mmm. There is no notation about mmm being suppressed when milliseconds are 0.
It looks like style 121 (ODBC canonical with milliseconds) is now demonstrably superior to style 127, no thanks to Microsoft.
Please see this thread for an interesting discussion – looks like we may want to perform more tests.
I made some changes today based on this thread.
Very useful post. I only use SQL2000 and had come up with method 1 from my own first principles.
Method 2 was very ineresting and for my complex piece of work was more efficient.
I did however eventually use a method not listed here. I precalculated the value I needed and stored that max value on the parent table (couldn’t use a #temp). The execution time pre-calculation plus the actual query together was less than either method 1 or 2.
Perhaps not elegant or normalised but it was faster.
@Paul,
As long as the max value is unique per parent scope you are in good shape with that method. Saving values temporarily to a table is often a good way to increase performance as this uses a smaller resource footprint and guarantees the engine’s fast choice of correct execution plan for each individual step.
I would, however, caution against storing temporary Max values in parent tables. If the column is intended to be permanent, then your table is badly denormalized and the value can become out of synch. If the column is intended to be temporary, then one shouldn’t be mucking up the transaction log with spurious updates to a table that hasn’t actually changed. Instead, I would use a temp table, which will only affect tempdb and for any appreciable number of rows could be superior to a table variable as it will have statistics aiding in execution plan choice.
This is a great summay of different approaches. It would be really helpful if it could be supplemented by some example performance data.
Did you see this blog post
Including aggregated column’s related values – part 2
The reference is at the bottom of this blog post
Thanks for this post helped me loads!
I am using this as part of a script to grab unique email addresses attached to user accounts as part of migrating to a new website. The new site has different user account restraints ie unique emails. old site many user account could have the same email addresses so had to find the usernames with most recent activity. Thanks again!
[…] also Including an Aggregated Column’s Related Values for 5 different ways to do this kind of […]