Calculating Median and Mode with SQL Server can be frustrating for some developers, but it doesn’t have to be. Often times, inexperienced developers will attempt to write this with procedural programming practices, but set based methods do exist.
Before showing you methods to calculate these values, it’s probably best to explain what they are.
Mean is another name for average. SQL Server has a built-in function to calculate this value.
Data ---- 1 2 5 5 5 6 6 6 7 9 10
To calculate the average, sum the data and divide by the number of rows. In this case, 1 + 2 + 5 + 5 + 5 + 6 + 6 + 6 + 7 + 9 + 10 = 62. 62/11 = 5.636363
Median represents the ‘middle’ value. To calculate the median by hand, you sort the values and return the value that appears in the middle of the list. If there is an odd number of items, there will be exactly one value in the middle of the list. If there is an even number of items, you average the 2 values that are closest to the middle.
Data 1 2 5 5 5 <span style="color:red;">6</span> 6 6 7 9 10
Since there is an odd number of rows, the row appearing in the middle of the list contains your median value.
Data 1 2 5 5 <span style="color:red;">5 6</span> 6 6 7 9
Now, there is an even number of rows. The median for this data set is (5 + 6)/2 = 5.5. Simply take the average of the 2 values appearing in the middle of the data set.
MODE
The mode for a data set is the item(s) that appear most frequently. To calculate this by hand, you write a distinct list of values and count the number of times a value appears. The value the appears the most is your mode.
Data Frequency ---- --------- 1 1 2 1 <span style="color:red;">5 3 6 3</span> 7 1 9 1 10 1
This data set is considered to be Bi-Modal because there are 2 values with the same frequency. With this data set, the modes are 5 and 6.
For demonstration purposes, I will create a table variable and populate it with data. All code will be based on this table variable. The data type for the data column will be Decimal(10,5). If we used an integer column, then we would need to concern ourselves with integer math issues, which is not the focus of this blog.
AVERAGE
As I stated earlier, SQL Server has a built-in function for calculating the average. The Avg function will ignore rows with NULL. So the average of 1, 2, NULL is 1.5 because the sum of the data is 3 and there are 2 rows that are not NULL. 3/2 = 1.5.
Declare @Temp Table(Id Int Identity(1,1), Data Decimal(10,5)) Insert into @Temp Values(1) Insert into @Temp Values(2) Insert into @Temp Values(5) Insert into @Temp Values(5) Insert into @Temp Values(5) Insert into @Temp Values(6) Insert into @Temp Values(6) Insert into @Temp Values(6) Insert into @Temp Values(7) Insert into @Temp Values(9) Insert into @Temp Values(10) Insert into @Temp Values(NULL) Select Avg(Data) From @Temp -- 5.636363
MEDIAN
To calculate the median, we will select the last value in the top 50 percent of rows, and the first value in the bottom 50 percent (all while ignoring NULL values).
To get the last value in the top 50 percent of rows….
Select Top 1 Data From ( Select Top 50 Percent Data From @Temp Where Data Is NOT NULL Order By Data ) As A Order By Data DESC
To get the first value in the last 50 percent of rows…
Select Top 1 Data From ( Select Top 50 Percent Data From @Temp Where Data Is NOT NULL Order By Data DESC ) As A Order By Data Asc
Putting it all together:
Declare @Temp Table(Id Int Identity(1,1), Data Decimal(10,5)) Insert into @Temp Values(1) Insert into @Temp Values(2) Insert into @Temp Values(5) Insert into @Temp Values(5) Insert into @Temp Values(5) Insert into @Temp Values(6) Insert into @Temp Values(6) Insert into @Temp Values(6) Insert into @Temp Values(7) Insert into @Temp Values(9) Insert into @Temp Values(10) Insert into @Temp Values(NULL) Select (( Select Top 1 Data From ( Select Top 50 Percent Data From @Temp Where Data Is NOT NULL Order By Data ) As A Order By Data DESC) + ( Select Top 1 Data From ( Select Top 50 Percent Data From @Temp Where Data Is NOT NULL Order By Data DESC ) As A Order By Data Asc)) / 2 -- 6
MODE
To Calculate the mode with sql server, we first need to get the counts for each value in the set. Then, we need to filter the data so that values equal to the count are returned.
Declare @Temp Table(Id Int Identity(1,1), Data Decimal(10,5)) Insert into @Temp Values(1) Insert into @Temp Values(2) Insert into @Temp Values(5) Insert into @Temp Values(5) Insert into @Temp Values(5) Insert into @Temp Values(6) Insert into @Temp Values(6) Insert into @Temp Values(6) Insert into @Temp Values(7) Insert into @Temp Values(9) Insert into @Temp Values(10) Insert into @Temp Values(NULL) SELECT TOP 1 with ties DATA FROM @Temp WHERE DATA IS Not NULL GROUP BY DATA ORDER BY COUNT(*) DESC
As you can see, there are set based methods for calculating all of these values, which can be many times faster than calculating these values in a cursor.
27 Comments
Thank you! I run into this all the time and I never remember to take advantage of the top X percent phrase.
Very helpful, thanks for posting this!
this is helpful but I made changes to Mode calculation as following..is it any better?
SELECT TOP 1 DATA
FROM @Temp
WHERE DATA IS Not NULL
GROUP BY DATA
ORDER BY COUNT(*) DESC
@manish,
Strictly speaking, your query is not correct because data sets can be multi-modal, which your query does not accommodate.
For example, if you have the following values: 1,2,5,5,5,6,6,6,7,9,10
5 appears 3 times and 6 appears 3 times. This data set is considered multi-modal because there are multiple values that appear in the data set the same number of times.
That being said, a slight modification to your query will return the correct results and execute faster than the query I show.
SELECT TOP 1 WITH TIES DATA
FROM @Temp
WHERE DATA IS Not NULL
GROUP BY DATA
ORDER BY COUNT(*) DESC
Notice the “WITH TIES” addition to your query. This allows multiple rows to be returned where the top 1 value appears in multiple rows.
Thank you for your comment. I will change the query in the blog so that others may benefit from this (without needing to read these comments).
very useful, thx 🙂
Very helpful. One suggestion on the mode, however. You probably want a HAVING COUNT(*)>1 in there to account for the fact that some datasets may not have a mode.
Ben,
Could you give an example of a data set that doesn’t have a mode? Consider the data set:
1
2
Isn’t the mode [1, 2]?
Thank you for the great tip, helped a lot!
How would you get the mean, mode, median By a Group of some sort?
DECLARE @Temp TABLE(Id INT IDENTITY(1,1),Application_id int, DATA DECIMAL(10,5))
INSERT INTO @Temp VALUES(1,1)
INSERT INTO @Temp VALUES(1,2)
INSERT INTO @Temp VALUES(1,5)
INSERT INTO @Temp VALUES(1,5)
INSERT INTO @Temp VALUES(2,5)
INSERT INTO @Temp VALUES(1,6)
INSERT INTO @Temp VALUES(2,6)
INSERT INTO @Temp VALUES(2,6)
INSERT INTO @Temp VALUES(1,7)
INSERT INTO @Temp VALUES(2,9)
INSERT INTO @Temp VALUES(2,10)
INSERT INTO @Temp VALUES(null,NULL)
Want to get the mean, mode, median by the Application_Id in the above table.
are you sure mean equals average in statistics?
@greatbear302,
Yes. I am sure that “Arithmetic Mean” is the same thing as Avg. There are other types of means that statisticians use, like Geometric Mean, Harmonic Mean, etc…
The calculations that SQL Server perform with the AVG aggregate function is what statisticians refer to as Arithmetic Mean.
For more information regarding other types of Means:
http://en.wikipedia.org/wiki/Mean
Great Article and Helped me a lot to clearly understand the logic.
Hi,i cant have the results i want, here below my sample. could you help please? I have
select 75012345 bc, cast(2.5 as decimal(12,2)) vl into _tbl
insert into _tbl select 75012345, 5.0
insert into _tbl select 75012345, 4.0
insert into _tbl select 75054321, 3.5
insert into _tbl select 75054321, 2.0
insert into _tbl select 75054321, 3.0
select top 1 with ties vl
from _tbl
where vl is not null
group by vl
order by count(*) desc
select bc, avg(vl) from _tbl group by bc
ivonna,
The code you are using is for MODE. Your sample data set is considered multi-modal because each value appears exactly once.
i have the following sample data and script cobbled together through forums and head scratching (fairly new to SQL)
I can get the median for the total of the figures but need to group it by month.
Any idea anyone??
CREATE TABLE #data (number INT, Month_Name nvarchar (10))
INSERT INTO #data
SELECT 15 as number, ‘jan’ as Month_Name union all
SELECT 26 as number, ‘jan’ as Month_Name union all
SELECT 47 as number, ‘jan’ as Month_Name union all
SELECT 25 as number, ‘jan’ as Month_Name union all
SELECT 15 as number, ‘jan’ as Month_Name union all
SELECT 20 as number, ‘jan’ as Month_Name union all
SELECT 22 as number, ‘jan’ as Month_Name union all
SELECT 40 as number, ‘jan’ as Month_Name union all
SELECT 98 as number, ‘mar’ as Month_Name union all
SELECT 15 as number, ‘mar’ as Month_Name union all
SELECT 48 as number, ‘mar’ as Month_Name union all
SELECT 75 as number, ‘mar’ as Month_Name union all
SELECT 25 as number, ‘mar’ as Month_Name union all
SELECT 40 as number, ‘mar’ as Month_Name union all
SELECT 44 as number, ‘mar’ as Month_Name union all
SELECT 40 as number, ‘mar’ as Month_Name union all
SELECT 5 as number, ‘feb’ as Month_Name union all
SELECT 2 as number, ‘feb’ as Month_Name union all
SELECT 3 as number, ‘feb’ as Month_Name union all
SELECT 4 as number, ‘feb’ as Month_Name union all
SELECT 5 as number, ‘feb’ as Month_Name union all
SELECT 2 as number, ‘feb’ as Month_Name union all
SELECT 3 as number, ‘feb’ as Month_Name union all
SELECT 4 as number, ‘feb’ as Month_Name union all
SELECT 7 as number, ‘feb’ as Month_Name
select * from #data
–8888888888888888888888888888888888
SELECT AVG(1.0E * number) ,Month_Name as mine
FROM (
SELECT number,Month_Name,
2 * ROW_NUMBER() OVER (ORDER BY number) – COUNT(*) OVER () AS y
FROM #data
)AS d
WHERE y BETWEEN 0 AND 2
group by Month_Name
–8888888888888888888888888888888888
drop table #data
Paul, try this:
Select MonthCount.Month_Name, Avg(Number)
From (
Select Month_Name, Count(*) As MonthCount From #Data Group By Month_Name
) As MonthCount
Inner Join (
select Number, Month_Name, Row_Number() Over (Partition By Month_Name Order By NUmber) As RowNumber
from #data
) As RowCounts
On MonthCount.Month_Name = RowCounts.Month_Name
Where 2 * RowNumber – MonthCount Between -1 and 1
Group By MonthCount.Month_Name
Thanks! One of the business managers thinks I’m a god now!
You never gave the sql for the MEAN.
Marge,
MEAN is exactly the same as Average, which I did show the code for. I labelled is AVERAGE instead of MEAN. Sorry for the confusion.
Interesting discussion with good links in this thread in MSDN forum
Hey guys, thanks for creating this blog. My question is that I’m a bit confused on the median coding. Can you please explain in detail, perhaps with an example?
Is there a way to do this if your version of SQL does not provide the TOP 50 Percent phrase?
Regarding the query for the median, it might be better to make the denominator 2.0 instead of just 2 so that it will work even for integer data types.
— Here’s an alternative solution for MEDIAN using
— ROWNUMBER and AVG
DECLARE @TotalRecords int
DECLARE @Temp Table(Id int IDENTITY(1,1), Data int) –Decimal(10,5))
INSERT INTO @Temp VALUES(1)
INSERT INTO @Temp VALUES(2)
INSERT INTO @Temp VALUES(3)
INSERT INTO @Temp VALUES(4)
–INSERT INTO @Temp VALUES(5)
SELECT @TotalRecords = COUNT(*)
FROM @Temp
;
WITH TEMP_WITH_ROW_NUMBER
AS
(
SELECT ROW_NUMBER() OVER(ORDER BY Data ASC) as [_RowIndex]
,Data
FROM @Temp
WHERE Data IS NOT NULL
)
SELECT AVG(Data * 1.0) as ‘Median’
FROM TEMP_WITH_ROW_NUMBER
WHERE _RowIndex IN (
FLOOR((@TotalRecords + 1) / 2.0) — MidLow
,CEILING((@TotalRecords + 1) / 2.0) — MidHigh
)
Hi Guys,
I need to get Modes for two columns at a time, see this:
Product Catalogue Page
1 1 3
1 1 2
1 2 3
Considering this table structure, my expected result is :
Product Catalogue Page
1 1 3
Where Catalogue=1 and Page=3 are Modes.
Pelase help
Ever find a solution? I have the same issue.
While the method of calculating mode is fantastic, and thank you for posting it, how would you go about calculating the mode of the values of several columns…like 64 more? Is it possible? Thanks again.