Login or Sign Up to become a member!

EXPERTS, INFORMATION, IDEAS & KNOWLEDGE

Social bookmarker Add this

Your profile

Search

July 2009
Mon Tue Wed Thu Fri Sat Sun
 << <   > >>
    1 2 3 4 5
6 7 8 9 10 11 12
13 14 15 16 17 18 19
20 21 22 23 24 25 26
27 28 29 30 31    

XML Feeds

Authors

« sp_replwritetovarbin Heap Overflow Code Exploit Code In The Wild, Works By Using Our Good Friend SQL InjectionSQL Friday, The Best SQL Server Links Of The Past Week Episode 3 »
The Data Management Journal

Calculating Mean, Median and Mode with SQL Server

by George Mastros


Permalink 22 Dec 2008 09:57 , Categories: Data Modelling & Design, Microsoft SQL Server

Calculating Median and Mode with SQL Server can be frustrating for some developers, but it doesn't have to be. Often times, inexperienced developers will attempt to write this with procedural programming practices, but set based methods do exist.

Before showing you methods to calculate these values, it’s probably best to explain what they are.
Mean is another name for average. SQL Server has a built-in function to calculate this value.

Data
----
1
2
5
5
5
6
6
6
7
9
10

To calculate the average, sum the data and divide by the number of rows. In this case, 1 + 2 + 5 + 5 + 5 + 6 + 6 + 6 + 7 + 9 + 10 = 62. 62/11 = 5.636363

Median represents the 'middle' value. To calculate the median by hand, you sort the values and return the value that appears in the middle of the list. If there is an add number of items, there will be exactly one value in the middle of the list. If there is an even number of items, you average the 2 values that are closest to the middle.

Data
1
2
5
5
5
6
6
6
7
9
10

Since there is an odd number of rows, the row appearing in the middle of the list contains your median value.

Data
1
2
5
5
5
6
6
6
7
9

Now, there is an even number of rows. The median for this data set is (5 + 6)/2 = 5.5. Simply take the average of the 2 values appearing in the middle of the data set.

MODE
The mode for a data set is the item(s) that appear most frequently. To calculate this by hand, you write a distinct list of values and count the number of times a value appears. The value the appears the most is your mode.

Data	Frequency
----    ---------
1	1
2	1
5	3
6	3
7	1
9	1
10	1

This data set is considered to be Bi-Modal because there are 2 values with the same frequency. With this data set, the modes are 5 and 6.

For demonstration purposes, I will create a table variable and populate it with data. All code will be based on this table variable. The data type for the data column will be Decimal(10,5). If we used an integer column, then we would need to concern ourselves with integer math issues, which is not the focus of this blog.



AVERAGE
As I stated earlier, SQL Server has a built-in function for calculating the average. The Avg function will ignore rows with NULL. So the average of 1, 2, NULL is 1.5 because the sum of the data is 3 and there are 2 rows that are not NULL. 3/2 = 1.5.

  1. DECLARE @Temp TABLE(Id INT IDENTITY(1,1), DATA DECIMAL(10,5))
  2.  
  3. INSERT INTO @Temp VALUES(1)
  4. INSERT INTO @Temp VALUES(2)
  5. INSERT INTO @Temp VALUES(5)
  6. INSERT INTO @Temp VALUES(5)
  7. INSERT INTO @Temp VALUES(5)
  8. INSERT INTO @Temp VALUES(6)
  9. INSERT INTO @Temp VALUES(6)
  10. INSERT INTO @Temp VALUES(6)
  11. INSERT INTO @Temp VALUES(7)
  12. INSERT INTO @Temp VALUES(9)
  13. INSERT INTO @Temp VALUES(10)
  14. INSERT INTO @Temp VALUES(NULL)
  15.  
  16. SELECT AVG(DATA)
  17. FROM   @Temp
  18.  
  19. -- 5.636363




MEDIAN
To calculate the median, we will select the last value in the top 50 percent of rows, and the first value in the bottom 50 percent (all while ignoring NULL values).
To get the last value in the top 50 percent of rows….

		Select Top 1 Data
		From   (
				Select Top 50 Percent Data
				From	@Temp
				Where	Data Is NOT NULL
				Order By Data
				) As A
		Order By Data DESC

To get the first value in the last 50 percent of rows…

		Select Top 1 Data
		From   (
				Select Top 50 Percent Data
				From	@Temp
				Where	Data Is NOT NULL
				Order By Data DESC
				) As A
		Order By Data Asc

Putting it all together:

  1. DECLARE @Temp TABLE(Id INT IDENTITY(1,1), DATA DECIMAL(10,5))
  2.  
  3. INSERT INTO @Temp VALUES(1)
  4. INSERT INTO @Temp VALUES(2)
  5. INSERT INTO @Temp VALUES(5)
  6. INSERT INTO @Temp VALUES(5)
  7. INSERT INTO @Temp VALUES(5)
  8. INSERT INTO @Temp VALUES(6)
  9. INSERT INTO @Temp VALUES(6)
  10. INSERT INTO @Temp VALUES(6)
  11. INSERT INTO @Temp VALUES(7)
  12. INSERT INTO @Temp VALUES(9)
  13. INSERT INTO @Temp VALUES(10)
  14. INSERT INTO @Temp VALUES(NULL)
  15.  
  16. SELECT ((
  17.         SELECT TOP 1 DATA
  18.         FROM   (
  19.                 SELECT  TOP 50 PERCENT DATA
  20.                 FROM    @Temp
  21.                 WHERE   DATA IS NOT NULL
  22.                 ORDER BY DATA
  23.                 ) AS A
  24.         ORDER BY DATA DESC) +
  25.         (
  26.         SELECT TOP 1 DATA
  27.         FROM   (
  28.                 SELECT  TOP 50 PERCENT DATA
  29.                 FROM    @Temp
  30.                 WHERE   DATA IS NOT NULL
  31.                 ORDER BY DATA DESC
  32.                 ) AS A
  33.         ORDER BY DATA ASC)) / 2
  34. -- 6




MODE
To Calculate the mode with sql server, we first need to get the counts for each value in the set. Then, we need to filter the data so that values equal to the count are returned.

  1. DECLARE @Temp TABLE(Id INT IDENTITY(1,1), DATA DECIMAL(10,5))
  2.  
  3. INSERT INTO @Temp VALUES(1)
  4. INSERT INTO @Temp VALUES(2)
  5. INSERT INTO @Temp VALUES(5)
  6. INSERT INTO @Temp VALUES(5)
  7. INSERT INTO @Temp VALUES(5)
  8. INSERT INTO @Temp VALUES(6)
  9. INSERT INTO @Temp VALUES(6)
  10. INSERT INTO @Temp VALUES(6)
  11. INSERT INTO @Temp VALUES(7)
  12. INSERT INTO @Temp VALUES(9)
  13. INSERT INTO @Temp VALUES(10)
  14. INSERT INTO @Temp VALUES(NULL)
  15.  
  16. SELECT   DATA
  17. FROM     @Temp
  18. WHERE DATA IS Not NULL
  19. GROUP BY DATA
  20. HAVING   COUNT(*) = (
  21.            SELECT MAX(CountRows)
  22.            FROM    (
  23.                    SELECT DATA, COUNT(*) AS CountRows
  24.                    FROM   @Temp
  25.                        WHERE DATA IS Not NULL
  26.                    GROUP BY DATA
  27.                    ) AS A)

As you can see, there are set based methods for calculating all of these values, which can be many times faster than calculating these values in a cursor.

2 comments »Send a trackback » 4670 views

Trackback address for this post

Trackback URL (right click and copy shortcut/link location)

2 comments

Comment from: dewitte [Member] Email
*****
Thank you! I run into this all the time and I never remember to take advantage of the top X percent phrase.
07/01/09 @ 09:08
Comment from: Lani Fraizer [Visitor] · http://www.synergiesinsync.org
*****
Very helpful, thanks for posting this!
17/03/09 @ 12:24

Leave a comment


Your email address will not be revealed on this site.

Your URL will be displayed.
PoorExcellent
(Line breaks become <br />)
(Name, email & website)
(Allow users to contact you through a message form (your email will not be revealed.)