Login or Sign Up to become a member!
LessThanDot Sit Logo

LessThanDot

Data Management

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot friendfeed Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Your profile

    Search

    XML Feeds

    Google Ads

    « The differences between LEN and DATALENGTH in SQL ServerIsNumeric, IsInt, IsNumber »
    comments

    In this co-authored blog post, Naomi and I will present six different solutions to the commonly experienced query problem of how to include an aggregated column's related values - values from the same row in the group that provided the displayed value.

    Background

    It's a fairly simple query to select the maximum date for each group in a GROUP BY query. You just throw in a Max() on the date column and then GROUP BY the other columns. For example,

    1. SELECT
    2.    CustomerID,
    3.    LastOrderDate = Max(OrderDate)
    4. FROM Orders
    5. GROUP BY CustomerID

    But a common query need is to include other column values from the same row as the most recent date. Unfortunately, putting aggregates on other columns doesn't work:

    1. SELECT
    2.    CustomerID,
    3.    LastOrderDate = Max(OrderDate),
    4.    LastSubTotal = Max(SubTotal) -- wrong: won't be the subtotal from Max(OrderDate), but the greatest order total ever placed by this customer.
    5. FROM Orders
    6. GROUP BY CustomerID

    It's almost like you need some kind of Max(OrderDate->SubTotal).

    Note that the desired results are fairly easy in MS Access using the aggregate functions Last and First:

    1. SELECT
    2.    CustomerID,
    3.    LastOrderDate = Last(OrderDate),
    4.    SubTotal = Last(SubTotal)
    5. FROM Orders
    6. GROUP BY CustomerID
    7. ORDER BY OrderDate

    What happens in Access is that the Last() aggregate combines with the ORDER BY clause to select values from the row containing the most recent Order Date.

    However, these functions are not available in SQL Server, likely because the query engine only does ordering at the very end, after GROUP BY and HAVING are processed. While this may seem like a defect, it is really a deliberate design trade-off, with compensations elsewhere in the engine.

    Test Case

    In order to have enough data to make results significant and also to let others run the same queries against the same data, we are going to use the AdventureWorks sample database.

    In our scenario, we want to show customers' account numbers along with the date and subtotal of their most recent order. We're keeping it simple for clarity, but note that you can include as many columns as you like. In a later installment we'll do these same queries with more columns and different scenarios, and will expand on some of the queries a bit. We'll also discuss performance to help you select between them.

    Note that the aggregate doesn't have to be on a date. Perhaps you want to know the date of the largest dollar value order per customer. In this case, you'd be doing a Max() on the order value instead of a date, and you would most certainly run into duplicates. Keep this in mind when reading below and when developing your own queries.

    The basic query we'll be working with is:

    1. SELECT
    2.    C.AccountNumber,
    3.    LastOrderDate = Max(O.OrderDate)
    4. FROM
    5.    Sales.Customer C
    6.    INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID

    What we want to do is also return the Subtotal value from the same row as the LastOrderDate.

    Some Approaches to the Problem

    We know of only a few basic approaches to the problem:

    - Key Search -- use the LastOrderDate column as a key (along with customer AccountNumber) to locate the correct rows in the OrderHeader table. This requires an extra join and also the key isn't unique, so we might have to perform another Max() on a unique column and join another time, with the new column added to the key.

    - Number and Filter -- select all rows from OrderHeader, and number them in a way exploitable to correctly filter out unwanted rows. Observe that this won't require hitting the table again, but it could use a lot of extra resources to temporarily materialize more data. Skipping over obvious poor choices such as a temp table with an ordered update or a cursor, options might be a windowing function (SQL 2005 and up) or a temp table with an identity column (SQL 2000).

    - Simulate MS Access -- simulate how MS Access works somehow, getting only the data we need in a single table access (seek or scan). Let's imagine how the SQL engine already performs a Max. As it touches each row, it must store the intermediate value that is the best candidate so far for all the seen values. And it has to have one of these per group, per column. All that is needed is that every time it writes the Max() candidate into its temporary storage, it also writes the extra column we want into the same data.

    How could we do this ourselves? We need to ORDER BY OrderDate but actually yield SubTotal. SubTotal clearly won't sort in the correct order by itself. Remembering Max(OrderDate->SubTotal), what if it wasn't by itself, and somehow contained the OrderDate, too, to make it sort properly? So here we have an idea about packing both columns into a single value that sorts correctly yet can still be used to get the SubTotal back out. When the engine does its Max() bit as usual on the date and stores the best candidate, it will also be forced to store the other value at the same time (which we can extract later).

    So let's use those ideas and turn them into real queries. Unless otherwise stated, queries will work with SQL Server 2000.

    Key Search

    1. Correlated Subquery

    • Relies on unique order dates to avoid selecting multiple rows per customer.
    • Correlated subqueries perform well with small datasets and badly with large datasets (though better with SQL 2005 and 2008 because they can sometimes get converted to real joins).
    • Performance degrades geometrically as rows increase.
    • Does not handle NULLs correctly.
    1. SELECT
    2.    C.AccountNumber,
    3.    O.OrderDate,
    4.    O.SubTotal
    5. FROM
    6.    Sales.Customer C
    7.    INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID
    8. WHERE
    9.    O.OrderDate = (
    10.       SELECT Max(OrderDate)
    11.       FROM SalesOrderHeader O2
    12.       WHERE O2.CustomerID = O.CustomerID
    13.    )

    Another version of correlated subquery which sometimes performs much better is

    1. SELECT
    2.    C.AccountNumber,
    3.    O.OrderDate,
    4.    O.SubTotal
    5. FROM
    6.    Sales.Customer C
    7.    INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID
    8. WHERE
    9.    O.OrderID = (
    10.       SELECT top 1 OrderID
    11.       FROM SalesOrderHeader O2
    12.       WHERE O2.CustomerID = O.CustomerID order by O2.OrderDate Desc, O2.OrderID DESC)

    Another variation of this query is (suggested by Alejandro Mesa (Hunchback) in this MSDN thread):

    1. SELECT
    2.    C.AccountNumber,
    3.    O1.OrderDate,
    4.    O1.SubTotal
    5. FROM
    6.    Sales.Customer C
    7.    INNER JOIN Sales.SalesOrderHeader O1 ON C.CustomerID = O1.CustomerID
    8. WHERE not exists (select 1 from Sales.SalesOrderHeader O2 where O2.CustomerID = O1.CustomerID and O1.OrderDate < O2.OrderDate)

    Note, that the first variation of this query may return duplicate rows, but the second will always return only one row per group (with the latest OrderID).

    The third variation of this query with NOT EXISTS will produce duplicates in case of the same date, but can also be easily adjusted to return only one row with the max OrderID:

    1. SELECT
    2.    C.AccountNumber,
    3.    O1.OrderDate,
    4.    O1.SubTotal
    5. FROM
    6.    Sales.Customer C
    7.    INNER JOIN Sales.SalesOrderHeader O1 ON C.CustomerID = O1.CustomerID
    8. WHERE not exists (select 1 from Sales.SalesOrderHeader O2 where O2.CustomerID = O1.CustomerID and (O1.OrderDate < O2.OrderDate OR
    9. (O1.OrderDate = O2.OrderDate and O1.OrderID < O2.OrderID)))

    2. Derived Table

    • Relies on unique order dates to avoid selecting multiple rows per customer.
    • Performs better than the correlated subquery when querying a significant portion of all customers, but worse when only querying a few, since the last order date for every customer is always calculated.
    • Performance degrades linearly as rows increase.
    • Does not handle NULLs correctly.
    1. SELECT
    2.    C.AccountNumber,
    3.    O.OrderDate,
    4.    O.SubTotal
    5. FROM
    6.    Sales.Customer C
    7.    INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID
    8.    INNER JOIN (
    9.       SELECT CustomerID, LastOrderDate = Max(OrderDate)
    10.       FROM Sales.SalesOrderHeader
    11.       GROUP BY CustomerID
    12.    ) O2 ON O.CustomerID = O2.CustomerID AND O.OrderDate = O2.LastOrderDate


    Number and Filter

    Windowing Functions

    The windowing functions were introduced in SQL Server 2005, so the two queries below will not work in SQL Server 2000.

    • Will always return one row per customer. Using RANK() instead of ROW_NUMBER() will yield the same results as queries 1 and 2.
    • Easily supports not just 1 but n rows per group.
    • Performance degrades fairly linearly as rows increase. The windowing functions are fast but huge rowsets can begin to bog them down.

    3. Windowing Function - Two Stages

    • A common table expression is not needed, but is cleaner. The CTE query can be made into a derived table if desired.
    1. WITH OrderData AS (
    2.    SELECT
    3.       C.AccountNumber,
    4.       O.OrderDate,
    5.       O.SubTotal,
    6.       Selector = ROW_NUMBER() OVER (PARTITION BY O.CustomerID ORDER BY O.OrderDate DESC)
    7.    FROM
    8.       Sales.Customer C
    9.       INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID
    10. )
    11. SELECT AccountNumber, OrderDate, SubTotal
    12. FROM OrderData
    13. WHERE Selector = 1

    4. Windowing Function - One Stage

    1. SELECT TOP 1 WITH TIES
    2.    C.AccountNumber,
    3.    O.OrderDate,
    4.    O.SubTotal
    5. FROM
    6.    Sales.Customer C
    7.    INNER JOIN Sales.SalesOrderHeader O ON C.CustomerID = O.CustomerID
    8. ORDER BY
    9.    ROW_NUMBER() OVER (PARTITION by O.CustomerID ORDER BY OrderDate DESC)


    Simulate MS Access

    5. Compound Key, aka Packed Values

    • Uses only one table access with a simple GROUP BY.
    • Naomi first learned this idea from a Russian SQL FAQ.
    • Erik cooked up a similar idea in 2008.
    • This example uses a varchar column to hold the packed values, but be aware that converting to and from binary is faster than char (plus harder to use and more error-prone) and binary aggregation is faster. The crucial part is being able to faithfully extract the data that was put in.
    • This query has no problem with duplicates. It does effectively sort by the full compound value, so if multiple orders for the same customer can be on the same order date and you want some other column to determine which one is selected, it has to be included in the packed data.
    • As given, this query does not handle NULLs at all. A few strategic Coalesces can fix that.
    • There is no significant performance difference between unpacking one many-item compound value or many one-item compound values (though you'd always include all columns needed for ordering).
    1. SELECT
    2.    CustomerID,
    3.    LastOrderDate,
    4.    LastSubTotal = Convert(money, Substring(DateAndSubtotal, 20, 25))
    5. FROM
    6.    (
    7.       SELECT
    8.          CustomerID,
    9.          LastOrderDate = Max(OrderDate),
    10.          DateAndSubtotal = Max(convert(varchar(50), OrderDate, 121) + Convert(varchar(50), Subtotal))
    11.       FROM Sales.SalesOrderHeader
    12.       GROUP BY CustomerID
    13.    ) X

    Isn't that cool? I don't advocate using it all the time, but when performance is bad and the trade-offs are worth it, do it.

    In closing, thank you for visiting. We hope this is useful to you. You might like to see the next installment Including an Aggregated Column's Related Values - Part 2, discussing this issue in further depth.

    Erik

    P.S. I would like to thank Naomi for co-authoring this blog post with me. She did the heavy lifting of actually writing queries and testing their performance; I just fleshed out the explanations a bit.

    About the Author

    Erik has been working in IT since 1993 and starting in 2004 has specialized in MS SQL Server query writing, database design, and reporting services. He also professionally does web site design/developing and writes C# applications (mostly for projects involving his databases and web sites). His career in the industry truly started when he first began using computers in the late 70s. In 1984 he began programming on the IBM PCJr, and performed game testing on King's Quest I. These early experiences gave him the taste and drive to continue an incremental and self-taught path all the way to his current position. Erik is also interested in taekwondo, go (ranks 4k on KGS), sci-fi books, mathematics, philosophy, religion, and rollerblading. He and his family live on the West Coast of the US.
    Social SitingsLTD RSS Feed
    10951 views
    InstapaperVote on HN

    15 comments

    Comment from: SQLDenis [Member] Email
    *****
    SQLDenis Excellent post and much needed. Thank you both.

    I actually ask this question when I interview people and I can tell you that 95% of the people are NOT able to write any of these 5 versions. They can write a query that gives the max for one column for example but when you tell them that you want all the columns it becomes problematic

    Would love to see the performance differences between the 5 queries
    07/19/09 @ 09:00
    Comment from: Ted Krueger (onpnt) [Member]
    *****
    Ted Krueger (onpnt) Agree, this is a really good post and one that deffinetely gets people. I had to work with my SQL Dev a few times getting over this hill.

    Very nice Erik and Naomi!
    07/20/09 @ 05:58
    Comment from: David Forck (thirster42) [Member]
    *****
    good read. still haven't quite wrapped my mind on how row_number works.
    07/21/09 @ 07:25
    Comment from: Moacyr Zalcman [Visitor]
    *****
    Moacyr Zalcman Excellent post, in Universal Thread I asked this matter and normally use the CTE approach, but is important to have things clarified as this post does.

    Thanks
    07/28/09 @ 06:48
    Comment from: Mark [Visitor]
    Mark Great post.

    For the Simulate MS Access compound key solution I found that using a style of 126 as in the following:

    MAX(CONVERT(VARCHAR(50), OrderDate, 126)

    produced inconsistent datetime lengths for the data I had.

    I used the following:

    MAX(CONVERT(VARCHAR(50), OrderDate, 20) and it worked ok for me.

    I would like to thank you for a great post. One of the best I have seen on the internet.

    cheers
    Mark
    11/10/09 @ 14:21
    Comment from: Erik [Member] Email
    Erik Mark,

    Thanks for the compliment!

    About the lengths, that's very interesting. I don't understand how style 126 would give inconsistent lengths, since each value is padded out to its max length. Could you give some examples of values that resulted in different lengths?

    I can at least say for certain that using style 20 will not give correct results because it loses precision by truncating the milliseconds portion of the given datetime.

    Style 126: 2009-11-10T15:03:01.003
    Style 020: 2009-11-10T15:03:01
    11/10/09 @ 16:02
    Comment from: Mark [Visitor]
    Mark Hi Emtucifor

    Please look at these dates

    2009-07-27 12:22:44.537 datetime
    2009-07-27T12:22:44.537 Style 126

    2005-11-21 00:00:00.000 datetime
    2005-11-21T00:00:00 Style 126


    When the time portion of the datetime is zero it will have a smaller length.

    Zero Time is very common in migrated dates from a database with dates only.

    Please advise if you get the same results or if I have made a mistake

    regards
    Mark

    11/11/09 @ 15:50
    Comment from: Erik [Member] Email
    Erik Thank you, Mark! You are correct. I will fix the blog post. For style 126, the milliseconds portion of the date is left off when it is zero. This is very disappointing because in Books Online it says:

    CAST and CONVERT (Transact-SQL)
    Style 126
    Standard ISO8601
    Input/Output yyyy-mm-ddThh:mm:ss.mmm (no spaces)

    This led me to believe that it would always display mmm. There is no notation about mmm being suppressed when milliseconds are 0.

    It looks like style 121 (ODBC canonical with milliseconds) is now demonstrably superior to style 127, no thanks to Microsoft.
    11/11/09 @ 16:43
    Comment from: Naomi Nosonovsky [Member]
    Naomi Nosonovsky Please see this thread for an interesting discussion - looks like we may want to perform more tests.
    03/10/10 @ 08:53
    Comment from: Naomi Nosonovsky [Member]
    Naomi Nosonovsky I made some changes today based on this thread.
    04/29/10 @ 22:09
    Comment from: Paul [Visitor] Email
    Paul Very useful post. I only use SQL2000 and had come up with method 1 from my own first principles.

    Method 2 was very ineresting and for my complex piece of work was more efficient.

    I did however eventually use a method not listed here. I precalculated the value I needed and stored that max value on the parent table (couldn't use a #temp). The execution time pre-calculation plus the actual query together was less than either method 1 or 2.

    Perhaps not elegant or normalised but it was faster.
    11/02/11 @ 06:20
    Comment from: Erik [Member] Email
    Erik @Paul,

    As long as the max value is unique per parent scope you are in good shape with that method. Saving values temporarily to a table is often a good way to increase performance as this uses a smaller resource footprint and guarantees the engine's fast choice of correct execution plan for each individual step.

    I would, however, caution against storing temporary Max values in parent tables. If the column is intended to be permanent, then your table is badly denormalized and the value can become out of synch. If the column is intended to be temporary, then one shouldn't be mucking up the transaction log with spurious updates to a table that hasn't actually changed. Instead, I would use a temp table, which will only affect tempdb and for any appreciable number of rows could be superior to a table variable as it will have statistics aiding in execution plan choice.
    11/02/11 @ 10:39
    Comment from: bob [Visitor]
    bob This is a great summay of different approaches. It would be really helpful if it could be supplemented by some example performance data.
    02/24/12 @ 11:02
    Comment from: Naomi Nosonovsky [Member]
    Naomi Nosonovsky Did you see this blog post
    Including aggregated column's related values - part 2

    The reference is at the bottom of this blog post
    02/24/12 @ 14:32
    Comment from: Samuel Lilleker [Visitor] · http://solar-roof-tiles.co.uk/
    Samuel Lilleker Thanks for this post helped me loads!

    I am using this as part of a script to grab unique email addresses attached to user accounts as part of migrating to a new website. The new site has different user account restraints ie unique emails. old site many user account could have the same email addresses so had to find the usernames with most recent activity. Thanks again!
    10/02/12 @ 07:47

    Leave a comment


    Your email address will not be revealed on this site.

    To mislead the spambots.

    Your URL will be displayed.
    (Line breaks become <br />)
    (Name, email & website)
    (Allow users to contact you through a message form (your email will not be revealed.)