Login or Sign Up to become a member!
LessThanDot Sit Logo

LessThanDot

Data Management

Less Than Dot is a community of passionate IT professionals and enthusiasts dedicated to sharing technical knowledge, experience, and assistance. Inside you will find reference materials, interesting technical discussions, and expert tips and commentary. Once you register for an account you will have immediate access to the forums and all past articles and commentaries.

LTD Social Sitings

Lessthandot twitter Lessthandot Linkedin Lessthandot friendfeed Lessthandot facebook Lessthandot rss

Note: Watch for social icons on posts by your favorite authors to follow their postings on these and other social sites.

Your profile

    Search

    XML Feeds

    Google Ads

    « MERGE bug when Foreign Key constraint is created on non-clustered index (2008RTM)SQL Server and xp_cmdshell - The good, the bad and more ugly »
    comments

    Average Joe. An average day. An average salary. An average run. An average house.

    You hear about "averages" a lot. But what does it really mean? And why is this important to a SQL Server professional?

    Being Average

    An average is the "middle" value in a set of data values. When talking about or calculating an average, most people consider using the "mean". However, there are also "median" and "mode" to take into consideration. Each is calculated differently, and can produce drastically different results across a data set. As a data professional, particularly when working with business intelligence and reporting, knowing what data point the business or user needs to know is of utmost importance.

    Let's use the AdventureWorks2008R2 sample database to explore these functions. We'll use the SalesOrderHeader and SalesOrderDetail tables to determine "averages".

    To begin, create a view that takes the orders, by salesperson, for a year, and sums the number of orders and value of orders.

    1. CREATE VIEW Sales.OrdersBySalesperson
    2. AS
    3. SELECT SOH.SalesOrderID, SOH.OrderDate, SOH.SalesPersonID, SOH.CustomerID, SUM(SOD.OrderQty) AS Qty, SUM(SOD.LineTotal) AS Value
    4. FROM Sales.SalesOrderHeader AS SOH
    5.     INNER JOIN Sales.SalesOrderDetail AS SOD ON SOD.SalesOrderID = SOH.SalesOrderID
    6. WHERE SOH.OrderDate >= '20080101'
    7.     AND SOH.OrderDate < '20090101'
    8.     AND NOT(SalesPersonID IS NULL)
    9. GROUP BY SOH.SalesOrderID, SOH.OrderDate, SOH.SalesPersonID, SOH.CustomerID;

    Mean

    The mean is calculated by adding all the values in a data set, then dividing by the number of values in the set.

    In SQL Server, this can easily be achieved by using the AVG function. (Note that NULL values are ignored by this function.)

    1. SELECT SalesPersonID, AVG(Value) AS MeanValue
    2. FROM Sales.OrdersBySalesperson AS OBSP
    3. WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282)
    4. GROUP BY SalesPersonID
    5. ORDER BY SalesPersonID;

    Results:



    Median

    The median is calculated by arranging all values in the data set in order, then determining the middle number. If there are an even number of values, you'll add the two in the middle and calculate the mean. In SQL Server, this isn't as easy to achieve. However, with the addition of common table expressions (CTEs) and ranking functions, it has become easier.

    First, we create a CTE that will order the sales value. The ROW_NUMBER function ranks the orders by value, looking at each salesperson separately. The COUNT function will tell us how many orders the salesperson has.

    1. WITH OrdersBySP (SPID, Value, RowNum, CountOrders) AS  
    2. (
    3.     SELECT SalesPersonID,
    4.         Value,
    5.         ROW_NUMBER() OVER (PARTITION BY SalesPersonID ORDER BY Value),
    6.         COUNT(SalesOrderID) OVER (PARTITION BY SalesPersonID)
    7.     FROM Sales.OrdersBySalesperson AS OBSP
    8.     WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282)
    9. )
    10.  
    11. SELECT SPID, Value, RowNum, CountOrders
    12. FROM OrdersBySP;

    Here's a sample of the results. As you can see, salesperson 275 has a total of 86 orders. Salesperson 277 has 97.



    We're going to add a WHERE clause to the CTE:

    1. WITH OrdersBySP (SPID, Value, RowNum, CountOrders) AS  
    2. (
    3.     SELECT SalesPersonID,
    4.         Value,
    5.         ROW_NUMBER() OVER (PARTITION BY SalesPersonID ORDER BY Value),
    6.         COUNT(SalesOrderID) OVER (PARTITION BY SalesPersonID)
    7.     FROM Sales.OrdersBySalesperson AS OBSP
    8.     WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282)
    9. )
    10.  
    11. SELECT SPID, Value, RowNum, CountOrders
    12. FROM OrdersBySP
    13. WHERE RowNum BETWEEN (CountOrders + 1)/2 AND (CountOrders + 2)/2;



    What does this mean? (Please laugh at that.) Remember, to calculate median, we have to determine the middle row. This will vary depending on whether we have an even or odd number of rows. So, for a salesperson with an even number of orders, it is getting the two in the middle. For a salesperson with an odd number of orders, it is getting the one in the middle.

    The next query will find the average of the two middle values, if necessary. If the two middle values are not the same, it will return a value that is not in the data set (and that’s OK!).

    1. WITH OrdersBySP (SPID, Value, RowNum, CountOrders) AS  
    2. (
    3.     SELECT SalesPersonID,
    4.         Value,
    5.         ROW_NUMBER() OVER (PARTITION BY SalesPersonID ORDER BY Value),
    6.         COUNT(SalesOrderID) OVER (PARTITION BY SalesPersonID)
    7.     FROM Sales.OrdersBySalesperson AS OBSP
    8.     WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282)
    9. )
    10.  
    11. SELECT SPID, AVG(Value) as AvgValue
    12. FROM OrdersBySP
    13. WHERE RowNum BETWEEN (CountOrders + 1)/2 AND (CountOrders + 2)/2
    14. GROUP BY SPID;



    You can see that this yields a different result set than using the AVG function does.

    Median in SQL Server 2012

    In the newest version of SQL Server, it gets even easier to calculate a median value. There is a new function, PERCENTILE_CONT. It "calculates a percentile based on a continuous distribution of the column value" (to quote Books Online). Let's see how this works.

    You will specify the percentile you want - in this case, we want the 50th, so we use (0.5). Then, we use WITHIN GROUP (ORDER BY ... ) to tell the function which values to sort and compute. I'm using OVER (PARTITION BY ... ) to tell the function that I want to divide the values up by salesperson.

    1. SELECT DISTINCT OBSP.SalesPersonID,
    2.     PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY OBSP.Value)
    3.         OVER (PARTITION BY OBSP.SalesPersonID) AS MedianCont
    4. FROM Sales.OrdersBySalesperson AS OBSP
    5. WHERE OBSP.SalesPersonID IN (274, 275, 277, 278, 279, 282)
    6. ORDER BY OBSP.SalesPersonID;

    It will look for a value in the percentile you request. The value will not necessarily exist in the set. If you want a function that looks at the values in the set and returns one in it, use PERCENTILE_DISC.

    This is a fantastic improvement!

    Mode

    The mode is the most frequently occurring value in a data set. This is more useful with a data set that has a lot of values that are the same.

    To translate this into T-SQL, let's look at the quantities of an item sold. We'll get the COUNT of the quantity, and put it in descending order to see what quantity of the item is ordered most frequently.

    1. SELECT COUNT(SOD.OrderQty) AS FrequencyOfValue, SOD.OrderQty
    2. FROM Sales.SalesOrderHeader AS SOH
    3.     INNER JOIN Sales.SalesOrderDetail AS SOD ON SOD.SalesOrderID = SOH.SalesOrderID
    4. WHERE SOH.OrderDate >= '20080101'
    5.     AND SOH.OrderDate < '20090101'
    6.     AND NOT(SOH.SalesPersonID IS NULL)
    7.         AND SOD.ProductID IN (864)
    8. GROUP BY SOD.OrderQty
    9. ORDER BY COUNT(SOD.OrderQty) DESC;

    Now, we will use the TOP clause to get the most-frequently-ordered quantity. Make sure you use WITH TIES so that if if two or more values occur with the same frequency, both are listed.

    1. SELECT TOP 1 WITH TIES SOD.OrderQty
    2. FROM Sales.SalesOrderHeader AS SOH
    3.     INNER JOIN Sales.SalesOrderDetail AS SOD ON SOD.SalesOrderID = SOH.SalesOrderID
    4. WHERE SOH.OrderDate >= '20080101'
    5.     AND SOH.OrderDate < '20090101'  
    6.     AND NOT(SOH.SalesPersonID IS NULL)
    7.         AND SOD.ProductID IN (864)
    8. GROUP BY SOD.OrderQty
    9. ORDER BY COUNT(SOD.OrderQty) DESC;



    It's Hard to Be Average

    As you've seen, "average" can mean more than one thing. Make sure that you understand the differences between the three options, and which one your users need to see in a given case.

    Resources

    LTD's own George Mastros wrote about this in 2008: http://blogs.lessthandot.com/index.php/DataMgmt/DataDesign/calculating-mean-median-and-mode-with-sq.

    Itzik Ben-Gan tackled median here: http://www.sqlmag.com/article/tsql3/calculating-the-median-gets-simpler-in-sql-server-2005.

    About the Author

    User bio imageJes Schultz Borland is a Consultant with Brent Ozar PLF, and a Microsoft SQL Server MVP. She holds an AAS - Programmer/Analyst degree, and has worked with SQL Server since 2007, focusing on Reporting Services and day-to-day administration. She is an active member of PASS, Director of Communication for MADPASS, founder of Tech on Tap, and a LessThanDot.com blogger. She's a frequent presenter at user groups, SQL Saturdays, and other community events. She is also an avid runner and chef.
    Social SitingsTwitterLinkedInLTD RSS Feed
    Instapaper

    2 comments

    Comment from: David Forck (thirster42) [Member]
    dont' forget standard deviation! ;)
    02/08/12 @ 12:09
    Comment from: Chuck Rummel [Visitor] · http://twitter.com/crummel4
    Chuck Rummel I haven't played with 2012 yet so how is percentile_cont(0.5) different than doing a max of the value after using ntile(2) ? Looks like I've got some learnin' to do...
    02/09/12 @ 18:43

    Leave a comment


    Your email address will not be revealed on this site.

    Your URL will be displayed.
    (Line breaks become <br />)
    (Name, email & website)
    (Allow users to contact you through a message form (your email will not be revealed.)