Average Joe. An average day. An average salary. An average run. An average house.

You hear about “averages” a lot. But what does it really mean? And why is this important to a SQL Server professional?

**Being Average**

An average is the “middle” value in a set of data values. When talking about or calculating an average, most people consider using the “mean”. However, there are also “median” and “mode” to take into consideration. Each is calculated differently, and can produce drastically different results across a data set. As a data professional, particularly when working with business intelligence and reporting, knowing what data point the business or user needs to know is of utmost importance.

Let’s use the AdventureWorks2008R2 sample database to explore these functions. We’ll use the SalesOrderHeader and SalesOrderDetail tables to determine “averages”.

To begin, create a view that takes the orders, by salesperson, for a year, and sums the number of orders and value of orders.

T-SQL | |

1 2 3 4 5 6 7 8 9 | CREATE VIEW Sales.OrdersBySalesperson AS SELECT SOH.SalesOrderID, SOH.OrderDate, SOH.SalesPersonID, SOH.CustomerID, SUM(SOD.OrderQty) AS Qty, SUM(SOD.LineTotal) AS Value FROM Sales.SalesOrderHeader AS SOH INNER JOIN Sales.SalesOrderDetail AS SOD ON SOD.SalesOrderID = SOH.SalesOrderID WHERE SOH.OrderDate >= '20080101' AND SOH.OrderDate < '20090101' AND NOT(SalesPersonID IS NULL) GROUP BY SOH.SalesOrderID, SOH.OrderDate, SOH.SalesPersonID, SOH.CustomerID; |

CREATE VIEW Sales.OrdersBySalesperson AS SELECT SOH.SalesOrderID, SOH.OrderDate, SOH.SalesPersonID, SOH.CustomerID, SUM(SOD.OrderQty) AS Qty, SUM(SOD.LineTotal) AS Value FROM Sales.SalesOrderHeader AS SOH INNER JOIN Sales.SalesOrderDetail AS SOD ON SOD.SalesOrderID = SOH.SalesOrderID WHERE SOH.OrderDate >= '20080101' AND SOH.OrderDate < '20090101' AND NOT(SalesPersonID IS NULL) GROUP BY SOH.SalesOrderID, SOH.OrderDate, SOH.SalesPersonID, SOH.CustomerID;

**Mean**

The mean is calculated by adding all the values in a data set, then dividing by the number of values in the set.

In SQL Server, this can easily be achieved by using the AVG function. (Note that NULL values are ignored by this function.)

T-SQL | |

1 2 3 4 5 | SELECT SalesPersonID, AVG(Value) AS MeanValue FROM Sales.OrdersBySalesperson AS OBSP WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282) GROUP BY SalesPersonID ORDER BY SalesPersonID; |

SELECT SalesPersonID, AVG(Value) AS MeanValue FROM Sales.OrdersBySalesperson AS OBSP WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282) GROUP BY SalesPersonID ORDER BY SalesPersonID;

Results:

**Median**

The median is calculated by arranging all values in the data set in order, then determining the middle number. If there are an even number of values, you’ll add the two in the middle and calculate the mean. In SQL Server, this isn’t as easy to achieve. However, with the addition of common table expressions (CTEs) and ranking functions, it has become easier.

First, we create a CTE that will order the sales value. The ROW_NUMBER function ranks the orders by value, looking at each salesperson separately. The COUNT function will tell us how many orders the salesperson has.

T-SQL | |

1 2 3 4 5 6 7 8 9 10 11 12 | WITH OrdersBySP (SPID, Value, RowNum, CountOrders) AS ( SELECT SalesPersonID, Value, ROW_NUMBER() OVER (PARTITION BY SalesPersonID ORDER BY Value), COUNT(SalesOrderID) OVER (PARTITION BY SalesPersonID) FROM Sales.OrdersBySalesperson AS OBSP WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282) ) SELECT SPID, Value, RowNum, CountOrders FROM OrdersBySP; |

WITH OrdersBySP (SPID, Value, RowNum, CountOrders) AS ( SELECT SalesPersonID, Value, ROW_NUMBER() OVER (PARTITION BY SalesPersonID ORDER BY Value), COUNT(SalesOrderID) OVER (PARTITION BY SalesPersonID) FROM Sales.OrdersBySalesperson AS OBSP WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282) ) SELECT SPID, Value, RowNum, CountOrders FROM OrdersBySP;

Here’s a sample of the results. As you can see, salesperson 275 has a total of 86 orders. Salesperson 277 has 97.

We’re going to add a WHERE clause to the CTE:

T-SQL | |

1 2 3 4 5 6 7 8 9 10 11 12 13 | WITH OrdersBySP (SPID, Value, RowNum, CountOrders) AS ( SELECT SalesPersonID, Value, ROW_NUMBER() OVER (PARTITION BY SalesPersonID ORDER BY Value), COUNT(SalesOrderID) OVER (PARTITION BY SalesPersonID) FROM Sales.OrdersBySalesperson AS OBSP WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282) ) SELECT SPID, Value, RowNum, CountOrders FROM OrdersBySP WHERE RowNum BETWEEN (CountOrders + 1)/2 AND (CountOrders + 2)/2; |

WITH OrdersBySP (SPID, Value, RowNum, CountOrders) AS ( SELECT SalesPersonID, Value, ROW_NUMBER() OVER (PARTITION BY SalesPersonID ORDER BY Value), COUNT(SalesOrderID) OVER (PARTITION BY SalesPersonID) FROM Sales.OrdersBySalesperson AS OBSP WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282) ) SELECT SPID, Value, RowNum, CountOrders FROM OrdersBySP WHERE RowNum BETWEEN (CountOrders + 1)/2 AND (CountOrders + 2)/2;

What does this *mean*? (Please laugh at that.) Remember, to calculate median, we have to determine the middle row. This will vary depending on whether we have an even or odd number of rows. So, for a salesperson with an even number of orders, it is getting the two in the middle. For a salesperson with an odd number of orders, it is getting the one in the middle.

The next query will find the average of the two middle values, if necessary. If the two middle values are not the same, it will return a value that is not in the data set (and that’s OK!).

T-SQL | |

1 2 3 4 5 6 7 8 9 10 11 12 13 14 | WITH OrdersBySP (SPID, Value, RowNum, CountOrders) AS ( SELECT SalesPersonID, Value, ROW_NUMBER() OVER (PARTITION BY SalesPersonID ORDER BY Value), COUNT(SalesOrderID) OVER (PARTITION BY SalesPersonID) FROM Sales.OrdersBySalesperson AS OBSP WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282) ) SELECT SPID, AVG(Value) as AvgValue FROM OrdersBySP WHERE RowNum BETWEEN (CountOrders + 1)/2 AND (CountOrders + 2)/2 GROUP BY SPID; |

WITH OrdersBySP (SPID, Value, RowNum, CountOrders) AS ( SELECT SalesPersonID, Value, ROW_NUMBER() OVER (PARTITION BY SalesPersonID ORDER BY Value), COUNT(SalesOrderID) OVER (PARTITION BY SalesPersonID) FROM Sales.OrdersBySalesperson AS OBSP WHERE SalesPersonID IN (274, 275, 277, 278, 279, 282) ) SELECT SPID, AVG(Value) as AvgValue FROM OrdersBySP WHERE RowNum BETWEEN (CountOrders + 1)/2 AND (CountOrders + 2)/2 GROUP BY SPID;

You can see that this yields a different result set than using the AVG function does.

**Median in SQL Server 2012**

In the newest version of SQL Server, it gets even easier to calculate a median value. There is a new function, PERCENTILE_CONT. It “calculates a percentile based on a continuous distribution of the column value” (to quote Books Online). Let’s see how this works.

You will specify the percentile you want – in this case, we want the 50th, so we use (0.5). Then, we use WITHIN GROUP (ORDER BY … ) to tell the function which values to sort and compute. I’m using OVER (PARTITION BY … ) to tell the function that I want to divide the values up by salesperson.

T-SQL | |

1 2 3 4 5 6 | SELECT DISTINCT OBSP.SalesPersonID, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY OBSP.Value) OVER (PARTITION BY OBSP.SalesPersonID) AS MedianCont FROM Sales.OrdersBySalesperson AS OBSP WHERE OBSP.SalesPersonID IN (274, 275, 277, 278, 279, 282) ORDER BY OBSP.SalesPersonID; |

SELECT DISTINCT OBSP.SalesPersonID, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY OBSP.Value) OVER (PARTITION BY OBSP.SalesPersonID) AS MedianCont FROM Sales.OrdersBySalesperson AS OBSP WHERE OBSP.SalesPersonID IN (274, 275, 277, 278, 279, 282) ORDER BY OBSP.SalesPersonID;

It will look for a value in the percentile you request. The value will not necessarily exist in the set. If you want a function that looks at the values in the set and returns one in it, use PERCENTILE_DISC.

This is a fantastic improvement!

**Mode**

The mode is the most frequently occurring value in a data set. This is more useful with a data set that has a lot of values that are the same.

To translate this into T-SQL, let’s look at the quantities of an item sold. We’ll get the COUNT of the quantity, and put it in descending order to see what quantity of the item is ordered most frequently.

T-SQL | |

1 2 3 4 5 6 7 8 9 | SELECT COUNT(SOD.OrderQty) AS FrequencyOfValue, SOD.OrderQty FROM Sales.SalesOrderHeader AS SOH INNER JOIN Sales.SalesOrderDetail AS SOD ON SOD.SalesOrderID = SOH.SalesOrderID WHERE SOH.OrderDate >= '20080101' AND SOH.OrderDate < '20090101' AND NOT(SOH.SalesPersonID IS NULL) AND SOD.ProductID IN (864) GROUP BY SOD.OrderQty ORDER BY COUNT(SOD.OrderQty) DESC; |

SELECT COUNT(SOD.OrderQty) AS FrequencyOfValue, SOD.OrderQty FROM Sales.SalesOrderHeader AS SOH INNER JOIN Sales.SalesOrderDetail AS SOD ON SOD.SalesOrderID = SOH.SalesOrderID WHERE SOH.OrderDate >= '20080101' AND SOH.OrderDate < '20090101' AND NOT(SOH.SalesPersonID IS NULL) AND SOD.ProductID IN (864) GROUP BY SOD.OrderQty ORDER BY COUNT(SOD.OrderQty) DESC;

Now, we will use the TOP clause to get the most-frequently-ordered quantity. Make sure you use WITH TIES so that if if two or more values occur with the same frequency, both are listed.

T-SQL | |

1 2 3 4 5 6 7 8 9 | SELECT TOP 1 WITH TIES SOD.OrderQty FROM Sales.SalesOrderHeader AS SOH INNER JOIN Sales.SalesOrderDetail AS SOD ON SOD.SalesOrderID = SOH.SalesOrderID WHERE SOH.OrderDate >= '20080101' AND SOH.OrderDate < '20090101' AND NOT(SOH.SalesPersonID IS NULL) AND SOD.ProductID IN (864) GROUP BY SOD.OrderQty ORDER BY COUNT(SOD.OrderQty) DESC; |

SELECT TOP 1 WITH TIES SOD.OrderQty FROM Sales.SalesOrderHeader AS SOH INNER JOIN Sales.SalesOrderDetail AS SOD ON SOD.SalesOrderID = SOH.SalesOrderID WHERE SOH.OrderDate >= '20080101' AND SOH.OrderDate < '20090101' AND NOT(SOH.SalesPersonID IS NULL) AND SOD.ProductID IN (864) GROUP BY SOD.OrderQty ORDER BY COUNT(SOD.OrderQty) DESC;

**It’s Hard to Be Average **

As you’ve seen, “average” can mean more than one thing. Make sure that you understand the differences between the three options, and which one your users need to see in a given case.

*Resources*

LTD’s own George Mastros wrote about this in 2008: http://blogs.lessthandot.com/index.php/DataMgmt/DataDesign/calculating-mean-median-and-mode-with-sq.

Itzik Ben-Gan tackled median here: http://www.sqlmag.com/article/tsql3/calculating-the-median-gets-simpler-in-sql-server-2005.

## 3 Comments

dont’ forget standard deviation! 😉

I haven’t played with 2012 yet so how is percentile_cont(0.5) different than doing a max of the value after using ntile(2) ? Looks like I’ve got some learnin’ to do…

I am trying to find the median. I read your blog. I understand the concept of choosing the middle rows..But here’s where I loose – RowNum BETWEEN (CountOrders + 1)/2 AND (CountOrders + 2)/2

For 277, the count orders is 97 which is an odd number and the median will be the 98/2=49th row – RowNum BETWEEN (49 AND 49.5)

For 275, the count orders is 86 which is an even number and the median will be the 87/2=43 and 44th rows – RowNum BETWEEN (43.5 AND 44)

Could you please let me know the logic here, not from a statistical point of view but based on the query point of view ? how come for salesID 277 the RowNum chosen is only the 49th row(RowNum BETWEEN (49 AND 49.5)), but for salesID 275, the Rownum choosen are both 43 and 44 rows(RowNum BETWEEN (43.5 AND 44)).

Thanks