Login or Sign Up to become a member!
LessThanDot Sit Logo

LessThanDot

Data Management

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot friendfeed Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Your profile

Search

XML Feeds

Google Ads

« The differences between LEN and DATALENGTH in SQL ServerIsNumeric, IsInt, IsNumber »
comments
Rate Post:
submit to reddit Digg!FacebookDotnetkicks

In this co-authored blog post, Naomi and I will present six different solutions to the commonly experienced query problem of how to include an aggregated column's related values - values from the same row in the group that provided the displayed value.

Background

It's a fairly simple query to select the maximum date for each group in a GROUP BY query. You just throw in a Max() on the date column and then GROUP BY the other columns. For example,

  1. SELECT
  2.    CustomerID,
  3.    LastOrderDate = MAX(OrderDate)
  4. FROM Orders
  5. GROUP BY CustomerID

But a common query need is to include other column values from the same row as the most recent date. Unfortunately, putting aggregates on other columns doesn't work:

  1. SELECT
  2.    CustomerID,
  3.    LastOrderDate = MAX(OrderDate),
  4.    LastSubTotal = MAX(SubTotal) -- wrong: won't be the subtotal from Max(OrderDate), but the greatest order total ever placed by this customer.
  5. FROM Orders
  6. GROUP BY CustomerID

It's almost like you need some kind of Max(OrderDate->SubTotal).

Note that the desired results are fairly easy in MS Access using the aggregate functions Last and First:

  1. SELECT
  2.    CustomerID,
  3.    LastOrderDate = LAST(OrderDate),
  4.    SubTotal = LAST(SubTotal)
  5. FROM Orders
  6. GROUP BY CustomerID
  7. ORDER BY OrderDate

What happens in Access is that the Last() aggregate combines with the ORDER BY clause to select values from the row containing the most recent Order Date.

However, these functions are not available in SQL Server, likely because the query engine only does ordering at the very end, after GROUP BY and HAVING are processed. While this may seem like a defect, it is really a deliberate design trade-off, with compensations elsewhere in the engine.

Test Case

In order to have enough data to make results significant and also to let others run the same queries against the same data, we are going to use the AdventureWorks sample database.

In our scenario, we want to show customers' account numbers along with the date and subtotal of their most recent order. We're keeping it simple for clarity, but note that you can include as many columns as you like. In a later installment we'll do these same queries with more columns and different scenarios, and will expand on some of the queries a bit. We'll also discuss performance to help you select between them.

Note that the aggregate doesn't have to be on a date. Perhaps you want to know the date of the largest dollar value order per customer. In this case, you'd be doing a Max() on the order value instead of a date, and you would most certainly run into duplicates. Keep this in mind when reading below and when developing your own queries.

The basic query we'll be working with is:

  1. SELECT
  2.    C.AccountNumber,
  3.    LastOrderDate = MAX(O.OrderDate)
  4. FROM
  5.    Sales.Customer C
  6.    INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID

What we want to do is also return the Subtotal value from the same row as the LastOrderDate.

Some Approaches to the Problem

We know of only a few basic approaches to the problem:

- Key Search -- use the LastOrderDate column as a key (along with customer AccountNumber) to locate the correct rows in the OrderHeader table. This requires an extra join and also the key isn't unique, so we might have to perform another Max() on a unique column and join another time, with the new column added to the key.

- Number and Filter -- select all rows from OrderHeader, and number them in a way exploitable to correctly filter out unwanted rows. Observe that this won't require hitting the table again, but it could use a lot of extra resources to temporarily materialize more data. Skipping over obvious poor choices such as a temp table with an ordered update or a cursor, options might be a windowing function (SQL 2005 and up) or a temp table with an identity column (SQL 2000).

- Simulate MS Access -- simulate how MS Access works somehow, getting only the data we need in a single table access (seek or scan). Let's imagine how the SQL engine already performs a Max. As it touches each row, it must store the intermediate value that is the best candidate so far for all the seen values. And it has to have one of these per group, per column. All that is needed is that every time it writes the Max() candidate into its temporary storage, it also writes the extra column we want into the same data.

How could we do this ourselves? We need to ORDER BY OrderDate but actually yield SubTotal. SubTotal clearly won't sort in the correct order by itself. Remembering Max(OrderDate->SubTotal), what if it wasn't by itself, and somehow contained the OrderDate, too, to make it sort properly? So here we have an idea about packing both columns into a single value that sorts correctly yet can still be used to get the SubTotal back out. When the engine does its Max() bit as usual on the date and stores the best candidate, it will also be forced to store the other value at the same time (which we can extract later).

So let's use those ideas and turn them into real queries. Unless otherwise stated, queries will work with SQL Server 2000.

Key Search

1. Correlated Subquery

  • Relies on unique order dates to avoid selecting multiple rows per customer.
  • Correlated subqueries perform well with small datasets and badly with large datasets (though better with SQL 2005 and 2008 because they can sometimes get converted to real joins).
  • Performance degrades geometrically as rows increase.
  • Does not handle NULLs correctly.
  1. SELECT
  2.    C.AccountNumber,
  3.    O.OrderDate,
  4.    O.SubTotal
  5. FROM
  6.    Sales.Customer C
  7.    INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID
  8. WHERE
  9.    O.OrderDate = (
  10.       SELECT MAX(OrderDate)
  11.       FROM SalesOrderHeader O2
  12.       WHERE O2.CustomerID = O.CustomerID
  13.    )

Another version of correlated subquery which seems to perform much better is

  1. SELECT
  2.    C.AccountNumber,
  3.    O.OrderDate,
  4.    O.SubTotal
  5. FROM
  6.    Sales.Customer C
  7.    INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID
  8. WHERE
  9.    O.OrderID = (
  10.       SELECT TOP 1 OrderID
  11.       FROM SalesOrderHeader O2
  12.       WHERE O2.CustomerID = O.CustomerID ORDER BY O2.OrderDate DESC)

Note, that the first variation of this query may return duplicate rows, but the second will always return only one row per group (we can not predict which one in case of duplicates)

2. Derived Table

  • Relies on unique order dates to avoid selecting multiple rows per customer.
  • Performs better than the correlated subquery when querying a significant portion of all customers, but worse when only querying a few, since the last order date for every customer is always calculated.
  • Performance degrades linearly as rows increase.
  • Does not handle NULLs correctly.
  1. SELECT
  2.    C.AccountNumber,
  3.    O.OrderDate,
  4.    O.SubTotal
  5. FROM
  6.    Sales.Customer C
  7.    INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID
  8.    INNER JOIN (
  9.       SELECT CustomerID, LastOrderDate = MAX(OrderDate)
  10.       FROM Sales.SalesOrderHeader
  11.       GROUP BY CustomerID
  12.    ) O2 ON O.CustomerID = O2.CustomerID AND O.OrderDate = O2.LastOrderDate


Number and Filter

Windowing Functions

The windowing functions were introduced in SQL Server 2005, so the two queries below would not work in SQL Server 2000.

  • Will always return one row per customer. Using RANK() instead of ROW_NUMBER() will yield the same results as queries 1 and 2.
  • Easily supports not just 1 but n rows per group.
  • Performance degrades fairly linearly as rows increase. The windowing functions are fast but huge rowsets can begin to bog them down.

3. Windowing Function - Two Stages

  • A common table expression is not needed, but is cleaner. The CTE query can be made into a derived table if desired.
  1. WITH OrderData AS (
  2.    SELECT
  3.       C.AccountNumber,
  4.       O.OrderDate,
  5.       O.SubTotal,
  6.       Selector = ROW_NUMBER() OVER (PARTITION BY O.CustomerID ORDER BY O.OrderDate DESC)
  7.    FROM
  8.       Sales.Customer C
  9.       INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID
  10. )
  11. SELECT AccountNumber, OrderDate, SubTotal
  12. FROM OrderData
  13. WHERE Selector = 1

4. Windowing Function - One Stage

  1. SELECT TOP 1 WITH TIES
  2.    C.AccountNumber,
  3.    O.OrderDate,
  4.    O.SubTotal
  5. FROM
  6.    Sales.Customer C
  7.    INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID
  8. ORDER BY
  9.    ROW_NUMBER() OVER (PARTITION BY O.CustomerID ORDER BY OrderDate DESC)


Simulate MS Access

5. Compound Key, aka Packed Values

  • Uses only one table access with a simple GROUP BY.
  • Naomi first learned this idea from a Russian SQL FAQ.
  • Erik cooked up a similar idea in 2008.
  • This example uses a varchar column to hold the packed values, but be aware that converting to and from binary is faster than char (plus harder to use and more error-prone) and binary aggregation is faster. The crucial part is being able to faithfully extract the data that was put in.
  • This query has no problem with duplicates. It does effectively sort by the full compound value, so if multiple orders for the same customer can be on the same order date and you want some other column to determine which one is selected, it has to be included in the packed data.
  • As given, this query does not handle NULLs at all. A few strategic Coalesces can fix that.
  • There is no significant performance difference between unpacking one many-item compound value or many one-item compound values (though you'd always include all columns needed for ordering).
  1. SELECT
  2.    CustomerID,
  3.    LastOrderDate,
  4.    LastSubTotal = CONVERT(MONEY, SUBSTRING(DateAndSubtotal, 20, 25))
  5. FROM
  6.    (
  7.       SELECT
  8.          CustomerID,
  9.          LastOrderDate = MAX(OrderDate),
  10.          DateAndSubtotal = MAX(CONVERT(VARCHAR(50), OrderDate, 121) + CONVERT(VARCHAR(50), Subtotal))
  11.       FROM Sales.SalesOrderHeader
  12.       GROUP BY CustomerID
  13.    ) X

Isn't that cool? I don't advocate using it all the time, but when performance is bad and the trade-offs are worth it, do it.

In closing, thank you for visiting. We hope this is useful to you. You might like to see the next installment Including an Aggregated Column's Related Values - Part 2, discussing this issue in further depth.

Erik

P.S. I would like to thank Naomi for co-authoring this blog post with me. She did the heavy lifting of actually writing queries and testing their performance; I just fleshed out the explanations a bit.

About the Author

Erik has been working in IT for 15 years and since 2004 has specialized in MS SQL Server query writing, database design, and reporting services. He also professionally does web site design and writes C# applications (mostly for projects involving his databases and web sites). His career in the industry truly started when he first began using computers in the late 70s. In 1984 he began programming on the IBM PCJr, and performed game testing on King's Quest I. These early experiences gave him the taste and drive to continue an incremental and self-taught path all the way to his current position. Erik is also interested in taekwondo, go (ranks 4k on KGS), sci-fi books, mathematics, philosophy, religion, and rollerblading. He lives with his family on the West Coast of the US.
Social SitingsLTD RSS Feed
3304 views
submit to reddit Digg!FacebookDotnetkicks

Comments and Feedback

9 comments

Comment from: SQLDenis [Member] Email
*****
Excellent post and much needed. Thank you both.

I actually ask this question when I interview people and I can tell you that 95% of the people are NOT able to write any of these 5 versions. They can write a query that gives the max for one column for example but when you tell them that you want all the columns it becomes problematic

Would love to see the performance differences between the 5 queries
07/19/09 @ 09:00
Comment from: onpnt [Member] Email
*****
Agree, this is a really good post and one that deffinetely gets people. I had to work with my SQL Dev a few times getting over this hill.

Very nice Erik and Naomi!
07/20/09 @ 05:58
Comment from: thirster42 [Member] Email
*****
good read. still haven't quite wrapped my mind on how row_number works.
07/21/09 @ 07:25
Comment from: Moacyr Zalcman [Visitor]
*****
Excellent post, in Universal Thread I asked this matter and normally use the CTE approach, but is important to have things clarified as this post does.

Thanks
07/28/09 @ 06:48
Comment from: Mark [Visitor]
Great post.

For the Simulate MS Access compound key solution I found that using a style of 126 as in the following:

MAX(CONVERT(VARCHAR(50), OrderDate, 126)

produced inconsistent datetime lengths for the data I had.

I used the following:

MAX(CONVERT(VARCHAR(50), OrderDate, 20) and it worked ok for me.

I would like to thank you for a great post. One of the best I have seen on the internet.

cheers
Mark
11/10/09 @ 14:21
Comment from: Emtucifor [Member] Email
Mark,

Thanks for the compliment!

About the lengths, that's very interesting. I don't understand how style 126 would give inconsistent lengths, since each value is padded out to its max length. Could you give some examples of values that resulted in different lengths?

I can at least say for certain that using style 20 will not give correct results because it loses precision by truncating the milliseconds portion of the given datetime.

Style 126: 2009-11-10T15:03:01.003
Style 020: 2009-11-10T15:03:01
11/10/09 @ 16:02
Comment from: Mark [Visitor]
Hi Emtucifor

Please look at these dates

2009-07-27 12:22:44.537 datetime
2009-07-27T12:22:44.537 Style 126

2005-11-21 00:00:00.000 datetime
2005-11-21T00:00:00 Style 126


When the time portion of the datetime is zero it will have a smaller length.

Zero Time is very common in migrated dates from a database with dates only.

Please advise if you get the same results or if I have made a mistake

regards
Mark

11/11/09 @ 15:50
Comment from: Emtucifor [Member] Email
Thank you, Mark! You are correct. I will fix the blog post. For style 126, the milliseconds portion of the date is left off when it is zero. This is very disappointing because in Books Online it says:

CAST and CONVERT (Transact-SQL)
Style 126
Standard ISO8601
Input/Output yyyy-mm-ddThh:mm:ss.mmm (no spaces)

This led me to believe that it would always display mmm. There is no notation about mmm being suppressed when milliseconds are 0.

It looks like style 121 (ODBC canonical with milliseconds) is now demonstrably superior to style 127, no thanks to Microsoft.
11/11/09 @ 16:43
Comment from: Naomi [Member] Email
Please see this thread http://social.msdn.microsoft.com/Forums/en-US/transactsql/thread/36fa158c-184a-45fa-8875-9a4ffff33c3b for an interesting discussion - looks like we may want to perform more tests.
03/10/10 @ 08:53

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)