This month’s T-SQL Tuesday is hosted by Jes Borland and it is all about grouping and aggregate functions.

T-SQL Tuesday logo

With the introduction of window functions in T-SQL language in SQL 2005, getting top N values per group became a simple task of utilizing DENSE_RANK() or RANK() or ROW_NUMBER() function depending on the business requirements.

However, there is still an important aggregate missing from the group. In particular, PERCENT() aggregate function.

There is a Connect item in regards to the missing functions.

You may find an article by Itzik Ben-Gan on this topic Calculate Percentiles very educational.

I become interested in the problem of finding N percent of the group after reading this thread in tek-tips forum SQL Min 1 of Top 10% and Max 1 of Bottom 10%???

The first idea of approaching this problem is in using NTILE() ranking function. It is as close as we can get, but yet it’s not an exactly percentage.

Here is an example that demonstrates discrepancy:

T-SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
USE AdventureWorks2008R2 
GO
;with cte as (SELECT p.FirstName, p.LastName
    ,ROW_NUMBER() OVER (ORDER BY s.SalesYTD) AS 'Row Number'
    ,RANK() OVER (ORDER BY s.SalesYTD) AS 'Rank'
    ,DENSE_RANK() OVER (ORDER BY s.SalesYTD) AS 'Dense Rank'
    ,NTILE(10) OVER (ORDER BY s.SalesYTD) AS 'Quartile',
    
    s.SalesYTD, a.PostalCode, a.City 
FROM Sales.SalesPerson s 
    INNER JOIN Person.Person p 
        ON s.BusinessEntityID = p.BusinessEntityID
    INNER JOIN Person.Address a 
        ON a.AddressID = p.BusinessEntityID
)
    
select cte.*, TP.*, LP.* from cte
LEFT join (select top 10 percent FirstName as TPFName, LastName as TPLName, SalesYTD as TPSalesYTD
  from cte Order by SalesYTD ) TP 
on cte.FirstName = TP.TPFName  
and cte.LastName = TP.TPLName 
LEFT join (select top 10 percent FirstName as LPFName, LastName as LPLName, SalesYTD as LPSalesYTD  
from cte Order by SalesYTD DESC) LP 
on cte.FirstName = LP.LPFName  
and cte.LastName = LP.LPLName 
order by cte.[Row Number] 
USE AdventureWorks2008R2 
GO
;with cte as (SELECT p.FirstName, p.LastName
    ,ROW_NUMBER() OVER (ORDER BY s.SalesYTD) AS 'Row Number'
    ,RANK() OVER (ORDER BY s.SalesYTD) AS 'Rank'
    ,DENSE_RANK() OVER (ORDER BY s.SalesYTD) AS 'Dense Rank'
    ,NTILE(10) OVER (ORDER BY s.SalesYTD) AS 'Quartile',
    
    s.SalesYTD, a.PostalCode, a.City 
FROM Sales.SalesPerson s 
    INNER JOIN Person.Person p 
        ON s.BusinessEntityID = p.BusinessEntityID
    INNER JOIN Person.Address a 
        ON a.AddressID = p.BusinessEntityID
)
    
select cte.*, TP.*, LP.* from cte
LEFT join (select top 10 percent FirstName as TPFName, LastName as TPLName, SalesYTD as TPSalesYTD
  from cte Order by SalesYTD ) TP 
on cte.FirstName = TP.TPFName  
and cte.LastName = TP.TPLName 
LEFT join (select top 10 percent FirstName as LPFName, LastName as LPLName, SalesYTD as LPSalesYTD  
from cte Order by SalesYTD DESC) LP 
on cte.FirstName = LP.LPFName  
and cte.LastName = LP.LPLName 
order by cte.[Row Number] 

As we can see from this example, we get correctly top 10 percent using NTILE(10) and
TOP 10 percent (LEFT JOIN) based on the SalesYTD in the ascending order, but the results from the bottom 10 percent (top 10 percent based on SalesYTD in the descending order) don’t match.
For the small result sets we can not really get a correct match using NTILE() as indicated in the Remarks section in BOL.

As we can see in our small set of only 17 records, first 7 groups have 2 records and last 3 groups have only 1 record, that’s why our top 10 percent in descending order didn’t match.

On the big enough sets the NTILE() is working correctly. Here is an example suggested by Peter Larsson, that demonstrates it:

T-SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
;WITH cteRandomData(Value)
AS (
           SELECT     ABS(CHECKSUM(NEWID())) AS Value
           FROM       master..spt_values
           WHERE      type = 'p'
                      AND number BETWEEN 1 AND 1000
), cteRanking(Value, rnk, cnt, ntil)
AS (
           SELECT     Value,
                      DENSE_RANK() OVER (ORDER BY Value),
                      COUNT(*) OVER (),
                      NTILE(10) over (order by Value) as ntil
           FROM       cteRandomData
), ctePercentile(Value, Percentile, ntil)
AS (
           SELECT     Value,
                      100 * (rnk - 1) / cnt + 1 AS Percentile,
                      ntil
           FROM       cteRanking
)
SELECT     *
FROM       ctePercentile
--WHERE      Percentile = 10
;WITH cteRandomData(Value)
AS (
           SELECT     ABS(CHECKSUM(NEWID())) AS Value
           FROM       master..spt_values
           WHERE      type = 'p'
                      AND number BETWEEN 1 AND 1000
), cteRanking(Value, rnk, cnt, ntil)
AS (
           SELECT     Value,
                      DENSE_RANK() OVER (ORDER BY Value),
                      COUNT(*) OVER (),
                      NTILE(10) over (order by Value) as ntil
           FROM       cteRandomData
), ctePercentile(Value, Percentile, ntil)
AS (
           SELECT     Value,
                      100 * (rnk - 1) / cnt + 1 AS Percentile,
                      ntil
           FROM       cteRanking
)
SELECT     *
FROM       ctePercentile
--WHERE      Percentile = 10