Login or Sign Up to become a member!
LessThanDot Sit Logo

LessThanDot

Data Management

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot friendfeed Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Your profile

    Search

    XML Feeds

    Google Ads

    « 2011 Quarter 3 Goals: Kicking [beep], Taking NamesReturning a value inserted in a table with a newsequentialid() default on a uniqueidentifier column »
    comments

    In this blog post I would like to touch a very common problem which often trips up newbies and even serious SQL Professionals. I wanted to write this blog post for quite some time, but always put it aside. However, just a few days ago I have been dealing with this problem at my work place, so I believe it's time to discuss the problem.

    For the discussion let's consider AdventureWorks database SalesOrderHeader and SalesOrderDetail tables. To illustrate the problem, let's assume that returns are handled in a different table with a structure very similar to SalesOrderDetail. And let's assume there are about 20 percent of returns (the business must not be doing too well!).

    Ok, to create the returns table in the Sales schema let's use the following code:

    1. SELECT IDENTITY(INT) AS ReturnID, D.LineTotal AS ReturnTotal, D.ProductID, D.OrderQty, D.UnitPrice, D.SalesOrderID
    2. INTO Sales.SalesReturns
    3. FROM Sales.SalesOrderDetail D TABLESAMPLE (20 PERCENT)

    Now, suppose we need to get quantity ordered and quantity returned as well as line total for both quantity and returned grouped by Customer ID. Let's write the most intuitive query:

    1. SELECT OH.CustomerID,
    2.        COALESCE(SUM(OD.OrderQty),0) AS [Ordered quantity],
    3.        COALESCE(SUM(OD.LineTotal),0) AS [Ordered],
    4.        COALESCE(SUM(R.OrderQty),0) AS [Returned quantity],      
    5.        COALESCE(SUM(R.ReturnTotal),0) AS [Returned]
    6.  
    7. FROM Sales.SalesOrderHeader OH LEFT JOIN Sales.SalesOrderDetail OD ON OH.SalesOrderID = OD.SalesOrderID
    8. LEFT JOIN Sales.SalesReturns R ON OH.SalesOrderID = R.SalesOrderID
    9. GROUP BY OH.CustomerID
    10. ORDER BY OH.CustomerID

    Which gives us these results

    Aggregate Query Results - Ordered and Returned
    CustomerIDOrdered quantityOrderedReturned quantityReturned
    1882724610.142300857713619.368600
    219825199.49939200.000000
    3201404084947.217763189163814775.599788
    482684493242.54479275304036924.558800
    51190325894.561736973261739.811536

    Now, let's verify these results by only running Ordered amounts.

    Let's run this query (I re-wrote using CTE for simplicity):

    1. ;with cte as (SELECT OH.CustomerID,
    2.        COALESCE(SUM(OD.OrderQty),0) AS [Ordered quantity],
    3.        COALESCE(SUM(OD.LineTotal),0) AS [Ordered]
    4.  
    5. FROM Sales.SalesOrderHeader OH LEFT JOIN Sales.SalesOrderDetail OD ON OH.SalesOrderID = OD.SalesOrderID
    6. GROUP BY OH.CustomerID
    7. )
    8. SELECT * FROM cte

    which produces the following results:

    Aggregate Query Results - Ordered only
    CustomerIDOrdered quantityOrdered
    112185177.081200
    219825199.499392
    31695361999.058170
    4980586524.947384
    529986177.847024

    We see that numbers differ dramatically. So, what did happen in our first query? Why did it produce such high numbers that have nothing to do with the real numbers?

    The reason is that for each row of Ordered results we get N Returned rows and so the total number of rows per one customer became M*N, where M is number of Ordered rows and N number of Returned rows. Therefore our results get much higher numbers.

    So, the rule I always employ when I need to do aggregates that will involve more than one table to JOIN is to produce the necessary results separately as derived tables or CTE and only then join them together to form final result.

    Here is how our query works when I re-write it using this idea:

    1. ;WITH Ordered AS (SELECT OH.CustomerID,
    2.        COALESCE(SUM(OD.OrderQty),0) AS [Ordered Quantity],
    3.        COALESCE(SUM(OD.LineTotal),0) AS [Ordered]
    4.  
    5. FROM Sales.SalesOrderHeader OH LEFT JOIN Sales.SalesOrderDetail OD ON OH.SalesOrderID = OD.SalesOrderID
    6. Group BY OH.CustomerID
    7. ),
    8.  
    9. Returned AS (SELECT OH.CustomerID,
    10.        COALESCE(SUM(R.OrderQty),0) AS [Returned Quantity],
    11.        COALESCE(SUM(R.ReturnTotal),0) AS [Returned]
    12.  
    13. FROM Sales.SalesOrderHeader OH LEFT JOIN Sales.SalesReturns R ON OH.SalesOrderID = R.SalesOrderID
    14. Group BY OH.CustomerID
    15. )
    16.  
    17. SELECT COALESCE(O.CustomerID, R.CustomerID) AS CustomerID,
    18. COALESCE(O.[Ordered Quantity],0) AS [Ordered Quantity],
    19. COALESCE(O.[Ordered],0) AS [Ordered],
    20. COALESCE(R.[Returned Quantity],0) AS [Returned Quantity],
    21. COALESCE(R.[Returned],0) AS [Returned]
    22.  
    23. FROM Ordered O FULL JOIN Returned R ON O.CustomerID = R.CustomerID
    24. ORDER BY COALESCE(O.CustomerID, R.CustomerID)

    with these correct now results:

    Aggregate Query Results - Ordered and Returned
    CustomerIDOrdered QuantityOrderedReturned QuantityReturned
    112185177.0812004437199.802700
    219825199.49939200.000000
    31695361999.05817047191827.440195
    4980586524.947384242130206.961392
    529986177.8470248222023.096824

    I used the FULL JOIN just in case we may have Ordered items and no returns or vs. versa for the customer.

    I received a question of how to write this code for SQL 2000. In SQL 2000 we can not use CTE (common table expressions), but we can use derived tables, so the code above will be re-written in this manner:

    1. SELECT COALESCE(O.CustomerID, R.CustomerID) AS CustomerID,
    2. COALESCE(O.[Ordered Quantity],0) AS [Ordered Quantity],
    3. COALESCE(O.[Ordered],0) AS [Ordered],
    4. COALESCE(R.[Returned Quantity],0) AS [Returned Quantity],
    5. COALESCE(R.[Returned],0) AS [Returned]
    6.  
    7. FROM (SELECT OH.CustomerID,
    8.        COALESCE(SUM(OD.OrderQty),0) AS [Ordered Quantity],
    9.        COALESCE(SUM(OD.LineTotal),0) AS [Ordered]
    10.  
    11. FROM Sales.SalesOrderHeader OH LEFT JOIN Sales.SalesOrderDetail OD ON OH.SalesOrderID = OD.SalesOrderID
    12. GROUP BY OH.CustomerID
    13. )O FULL JOIN (SELECT OH.CustomerID,
    14.        COALESCE(SUM(R.OrderQty),0) AS [Returned Quantity],
    15.        COALESCE(SUM(R.ReturnTotal),0) AS [Returned]
    16.  
    17. FROM Sales.SalesOrderHeader OH LEFT JOIN Sales.SalesReturns R ON OH.SalesOrderID = R.SalesOrderID
    18. GROUP BY OH.CustomerID
    19. ) R ON O.CustomerID = R.CustomerID
    20. ORDER BY COALESCE(O.CustomerID, R.CustomerID)

    So, hopefully you can now easily recognize the problem and apply the same idea when dealing with a similar situation.

    Say, few days ago I looked into report. The number of joins used in the query made my head dizzy - I told right away that because of the JOINs the total results are not going to be correct.

    So, how did I solve the problem? I included the totals using WINDOW AGGREGATE function in the CTE first and then added the rest of the JOINs to get the data needed for report.

    Unfortunately, I am not sure if an easier method exists for reports in SSRS when we need to present information for customer orders, for example, but also display customer addresses (and there may be many addresses per customer). Obviously, we will not get correct totals if we will just sum the records - we need to apply some tricks, simplest of them will be to pre-calculate totals and include these fields as part of the returned dataset. I will be interested to know your solutions for this problem.

    I should also mention another way to solve this query (assuming we also want to return the Customer name) by using OUTER APPLY query.

    Here I am showing two different approaches for comparison. The first approach performs better.

    1. ;WITH Ordered AS (SELECT OH.CustomerID,
    2.        COALESCE(SUM(OD.OrderQty),0) AS [Ordered Quantity],
    3.        COALESCE(SUM(OD.LineTotal),0) AS [Ordered]
    4.  
    5. FROM Sales.SalesOrderHeader OH LEFT JOIN Sales.SalesOrderDetail OD ON OH.SalesOrderID = OD.SalesOrderID
    6. Group BY OH.CustomerID
    7. ),
    8.  
    9. Returned AS (SELECT OH.CustomerID,
    10.        COALESCE(SUM(R.OrderQty),0) AS [Returned Quantity],
    11.        COALESCE(SUM(R.ReturnTotal),0) AS [Returned]
    12.  
    13. FROM Sales.SalesOrderHeader OH LEFT JOIN Sales.SalesReturns R ON OH.SalesOrderID = R.SalesOrderID
    14. Group BY OH.CustomerID
    15. )
    16.  
    17. SELECT TOP (10) C.FirstName, C.LastName,
    18. COALESCE(O.[Ordered Quantity],0) AS [Ordered Quantity],
    19. COALESCE(O.[Ordered],0) AS [Ordered],
    20. COALESCE(R.[Returned Quantity],0) AS [Returned Quantity],
    21. COALESCE(R.[Returned],0) AS [Returned]
    22. --INTO dbo.QueryResults
    23. FROM Sales.vIndividualCustomer C
    24. LEFT JOIN Ordered O On C.CustomerID = O.CustomerID
    25. LEFT JOIN Returned R ON C.CustomerID = R.CustomerID
    26. ORDER BY C.CustomerID
    27.  
    28.  
    29. SELECT TOP (10) C.CustomerID, C.FirstName, C.LastName,  
    30. COALESCE(O.[Ordered Quantity],0) AS [Ordered Quantity],
    31. COALESCE(O.[Ordered],0) AS [Ordered],
    32. COALESCE(R.[Returned Quantity],0) AS [Returned Quantity],
    33. COALESCE(R.[Returned],0) AS [Returned]
    34. FROM Sales.vIndividualCustomer C
    35. OUTER APPLY (SELECT SUM(OD.OrderQty) AS [Ordered Quantity],
    36. SUM(OD.LineTotal) AS [Ordered]
    37. FROM Sales.SalesOrderHeader OH
    38. LEFT JOIN Sales.SalesOrderDetail OD ON OH.SalesOrderID = OD.SalesOrderID
    39. WHERE OH.CustomerID = C.CustomerID) O
    40.  
    41. OUTER APPLY (SELECT
    42.        COALESCE(SUM(R.OrderQty),0) AS [Returned Quantity],
    43.        COALESCE(SUM(R.ReturnTotal),0) AS [Returned]
    44.  
    45. FROM Sales.SalesOrderHeader OH
    46. LEFT JOIN Sales.SalesReturns R ON OH.SalesOrderID = R.SalesOrderID
    47. WHERE OH.CustomerID = C.CustomerID) R
    48. ORDER BY C.CustomerID

    *** Remember, if you have a SQL related question, try our Microsoft SQL Server Programming forum or our Microsoft SQL Server Admin forum

    About the Author

    Naomi Nosonovsky, Senior Software Developer, has more than 15 years of enterprise experience in analysis, design, development and implementation of high-performance client/server data management solutions. She is a Personality of the Year for 2008, 2009, 2010 and 2011 at www.universalthread.com in .NET, SQL Server & VFP categories. She is also an All Star contributor/MCC Community Contributor at forums.asp.net and MSDN T-SQL forum.
    Social SitingsFacebookLinkedIndeliciousLTD RSS Feed
    1451 views
    Instapaper

    8 comments

    Comment from: TJ [Visitor]
    TJ Hi,

    Just wanted to say this helped me a lot.

    Thanks!
    09/27/11 @ 23:42
    Shimmy Awesome!!!
    10/06/11 @ 00:54
    Comment from: Janos Berke [Visitor] Email · http://iamberke.com
    Janos Berke Hi,

    I think ProductID is required logically at the SalesReturs left join: Sales.SalesReturns R ON OH.SalesOrderID = R.SalesOrderID AND OD.ProductID = R.ProductID

    I got correct result with this modified join ;)

    Anyway, I'm following your blog for a while and really like your posts.
    Janos
    10/16/11 @ 10:03
    Comment from: Naomi Nosonovsky [Member]
    Naomi Nosonovsky I got the correct results with this modification also, but this is a bit of cheating. It's only because we created our tables this way that returns matched order details by ProductID. If we would try to add dates condition, for example, we would not be able to match this way.
    10/16/11 @ 10:27
    Comment from: Janos Berke [Visitor] Email · http://iamberke.com
    Janos Berke I agree, it's is a kind of cheating. What I wanted to highlight, sometimes bad implementation or undefined business rules can cause wrong results as well. Your sample is very common in a system I'm working with right now and generates lot of (re)work for me ;)
    10/16/11 @ 12:07
    Comment from: Kevin S. Goff [Visitor] · http://www.kevinsgoff.net
    Kevin S. Goff Interesting...and I seem to recall seeing this discussion elsewhere. :)
    10/18/11 @ 19:43
    Comment from: Naomi Nosonovsky [Member]
    Naomi Nosonovsky I don't remember exactly.
    10/18/11 @ 21:19
    Comment from: lmu92 [Visitor]
    lmu92 I tend to disagree on your last statement "... The first approach performs better. ..." (indicating the OUTER APPLY solution would be slower).

    After creating the missing index
    [Sales].[SalesReturns] ([SalesOrderID])
    INCLUDE ([ReturnTotal],[OrderQty])

    I got the following results (based on SET STATISTICS TIME, IO ON):
    5ms for the OUTER APPLY and 117ms for CTE solution.

    I agree, the costs shown in the execution plan seem to be higher, but the statistics values indicate different.
    02/19/12 @ 10:49

    Leave a comment


    Your email address will not be revealed on this site.

    Your URL will be displayed.
    (Line breaks become <br />)
    (Name, email & website)
    (Allow users to contact you through a message form (your email will not be revealed.)