Exploring the T-SQL Enhancements in SQL Server 2005 : Ranking Functions

10/12/2010 9:39:26 AM

Databases hold data. Users sometimes want to perform simple calculations or algorithms on that data to rank the results in a specific order—like gold, silver, and bronze medals in the Olympics or the top 10 customers by region. SQL Server 2005 introduces functionality for using ranking expressions with your result set. You can select a number of ranking algorithms, which are then applied to a column that you specify. This comes in handy in .NET Framework applications for paging and sorting in a grid, as well as in many other scenarios.

ROW_NUMBER()

The most basic ranking function is ROW_NUMBER(). It returns a column as an expression that contains the row’s number in the result set. This number is used only in the context of the result set; if the result changes, the ROW_NUMBER() changes. The ROW_NUMBER() expression takes an ORDER BY statement with the column you want to use for the row count and the new OVER operator, which links the ORDER BY to the specific ranking function you are using. The ORDER BYOVER clause replaces an ORDER BY at the end of the SQL statement. in the

The following simple example gives a row number to each row in the result set, ordering by SalesOrderID:

Listing 1. Row number ranking

USE AdventureWorks
GO
SELECT SalesOrderID, CustomerID,
   ROW_NUMBER() OVER (ORDER BY SalesOrderID) AS RunningCount
FROM Sales.SalesOrderHeader
WHERE SalesOrderID>10000

Results are shown here:

SalesOrderID       CustomerID      RunningCount
-------------------------------------
43659              676             1
43660              117             2
43661              442             3
43662              227             4
43663              510             5
43664              397             6
43665              146             7
43666              511             8
43667              646             9
[more rows]

ORDER BY Options

The ranking functions order your result set by the fields specified in the ORDER BY statement contained in the OVER clause. Alternatively, you can include an additional ORDER BY statement in your result set; this optional statement is distinct from the ORDER BY clause in the OVERROW_NUMBER() function’s results are displayed in the order they are determined by in the additional ORDER BY statement, not by the ORDER BY statement contained within the OVER clause. The results can, therefore, be confusing, as in this example: expression. SQL Server 2005 allows this, but if you choose this option, the

--Row_Number using a unique value, different order by
SELECT SalesOrderID, CustomerID,
   ROW_NUMBER() OVER (ORDER BY SalesOrderID) AS RunningCount
FROM Sales.SalesOrderHeader
WHERE SalesOrderID>10000
ORDER BY CustomerID --Different ORDER BY than in Row_Number

Here are the results:

SalesOrderID    CustomerID    RunningCount
							------------------------------------------
43860           1             202
44501           1             843
45283           1             1625
46042           1             2384
46976           2             3318
47997           2             4339
49054           2             5396
50216           2             6558
51728           2             8070
57044           2             13386
63198           2             19540
69488           2             25830
65310           3             21652
71889           3             28231
53616           3             9958
[more rows]

As you can see, if you expect the results to be sorted by the OVER clause’s ORDER BY statement, you’d expect results ranked by SalesOrderID, when in fact they’re ordered by CustomerID.

If you choose the ROW_NUMBER() function to run against a non-unique column that contains multiple copies of the same value (also known as “ties,” such as the same amount of items sold and the same time in a race), ROW_NUMBER( ) breaks the tie and still produces a running count so that no rows have the same number. For example, CustomerID can repeat in this example, which will generate several ties; SQL Server simply increases the running count for each row, regardless of how many ties exist:

Listing 2. Row number ranking with ties

SELECT SalesOrderID, CustomerID,
   ROW_NUMBER() OVER (ORDER BY CustomerID) AS RunningCount
FROM Sales.SalesOrderHeader
WHERE SalesOrderID>10000

The results are shown here:

SalesOrderID       CustomerID       RunningCount
							------------------------------------------------
43860              1                1
44501              1                2
45283              1                3
46042              1                4
46976              2                5
47997              2                6
49054              2                7
50216              2                8
51728              2                9
57044              2                10
63198              2                11
69488              2                12
44124              3                13
[more rows]

Grouping and Filtering with a Row_Number

When you want to include a GROUP BY function in your query, ranking functions do not work. The easy way around this limitation is to create your GROUP BY in a CTE and then perform your ranking on the results, as shown here:

Listing 3. Grouping by row number

USE AdventureWorks
GO
--rank by totaldue, summed
--need a CTE to do the sum
--so this example will have a
--customerID summed with all of
--their orders
WITH CustomerSum
AS
(
SELECT CustomerID, SUM(totaldue) AS totalamt
FROM Sales.SalesOrderHeader
GROUP BY CustomerID
)
--this appends a row_number to the end of the resultset
SELECT *,
   ROW_NUMBER() OVER (ORDER BY totalamt DESC) AS RowNumber
FROM CustomerSum

Here are the results:

CustomerID      totalamt        RowNumber
								-----------------------------------------
678             1179857.4657    1
697             1179475.8399    2
170             1134747.4413    3
328             1084439.0265    4
514             1074154.3035    5
155             1045197.0498    6
72              1005539.7181    7
[more rows]

To filter by a ROW_NUMBER, you have to put the ROW_NUMBER function in a CTE.

Listing 4. Filtering by row number

USE AdventureWorks
GO
--use a common table expression if you want
--to filter by one of the rows that contain a
--ranking function since ranking functions
--are not allowed in where or having clauses
WITH NumberRows
AS
(
SELECT SalesOrderID, CustomerID,
   ROW_NUMBER() OVER (ORDER BY SalesOrderID) AS RowNumber
FROM Sales.SalesOrderHeader
)

SELECT * FROM NumberRows
WHERE RowNumber BETWEEN 100 AND 200

Here are the results:

SalesOrderID       CustomerID      RowNumber
--------------------------------------------
43759              13257           100
43760              16352           101
43761              16493           102
43762              27578           103
[more rows]

RANK()

The ranking function you will probably use the most is RANK(). RANK() ranks the data in the ORDER BY clause in the order you specify. RANK() is syntactically exactly like ROW_NUMBER() but with true ranking results. It works just like in the Olympics, when two people tie for the gold medal—the next rank is bronze. Similarly, with the RANK() function, if four rows are tied with the value ‘1’, the next row value for the rank column will be ‘5’. Consider this code:

Listing 5. The RANK() function

USE AdventureWorks
GO
SELECT SalesOrderID, CustomerID,
   RANK() OVER (ORDER BY CustomerID) AS RunningCount
FROM Sales.SalesOrderHeader
WHERE SalesOrderID > 10000

Here are the results:

SalesOrderID      CustomerID      RunningCount
							----------------------------------------------
43860             1               1
44501             1               1
45283             1               1
46042             1               1
46976             2               5
47997             2               5
49054             2               5
50216             2               5
51728             2               5
57044             2               5
63198             2               5
69488             2               5
44124             3               13
[more rows]

Just like with the other ranking functions, RANK() needs the aid of a CTE to work with aggregates. Consider this query that ranks the customers from highest to lowest by total sales. We have to use a CTE to perform the aggregate first and then rank over the newly created aggregate expression:

Listing 6. Ranked aggregates

USE AdventureWorks
GO
--rank by totaldue, summed
--need a CTE to do a sum
--so this example will have a
--customerID summed with all of
--their orders
WITH CustomerSum
AS
(
SELECT CustomerID, SUM(totaldue) AS totalamt
FROM Sales.SalesOrderHeader
GROUP BY CustomerID
)
SELECT *,
   RANK() OVER (ORDER BY totalamt DESC) AS Rank
FROM CustomerSum

The results are shown here. Notice that customer 697 is in first place:

CustomerID     totalamt         Rank
							-------------------------------------
697            1179475.8399     1
678            1164761.0209     2
170            1134747.4413     3
328            1084439.0265     4
514            1074154.3035     5
[more rows]

It is important to remember that the ranking functions provided by SQL Server 2005 are valid only for the scope of the running query. If the underlying data changes and then you run the same query again, you will get different results. For example, let’s modify a record from customer 678:

Listing 7. RANK() results change with underlying data changes

USE AdventureWorks
GO
UPDATE Sales.SalesOrderDetail
SET OrderQty = 13 --3 is the orig amt
WHERE SalesOrderID = 44086 AND SalesOrderDetailID = 1542

The results are shown here. Notice that customer 678 is now the top customer:

CustomerID      totalamt         Rank
							-------------------------------------
678             1185160.9609     1
697             1179475.8399     2
170             1134747.4413     3
328             1084439.0265     4
514             1074154.3035     5

DENSE_RANK() and NTILE(n)

The last two ranking functions we will cover are DENSE_RANK() and NTILE(n). DENSE_RANK() works exactly like RANK(), except it increments only on distinct rank changes.

Listing 8. Ranking with DENSE_RANK()

USE AdventureWorks
GO
SELECT SalesOrderID, CustomerID,
   DENSE_RANK() OVER (ORDER BY CustomerID) AS RunningCount
FROM Sales.SalesOrderHeader
WHERE SalesOrderID > 10000

The results are shown here:

SalesOrderID     CustomerID      RunningCount
							---------------------------------------------
43860            1               1
44501            1               1
45283            1               1
46042            1               1
46976            2               2
47997            2               2
49054            2               2
50216            2               2
51728            2               2
57044            2               2
63198            2               2
69488            2               2
44124            3               3
[more rows]

The following example shows the difference between RANK() and DENSE_RANK(). We will round the customers’ sales to the nearest hundred (because managers always like to look at whole numbers in their reports!) and look at the difference when we run into a tie. The code is here:

Listing 9. RANK() versus DENSE_RANK()

Code View: Scroll / Show All

USE AdventureWorks
GO
WITH CustomerSum
AS
(
SELECT CustomerID, ROUND(CONVERT(int, SUM(totaldue))/100,8) *100 as totalamt
FROM Sales.SalesOrderHeader
GROUP BY CustomerID
)
SELECT *,
   RANK() OVER (ORDER BY totalamt DESC) AS Rank,
   DENSE_RANK() OVER (ORDER BY totalamt DESC) AS DenseRank
FROM CustomerSum

And here are the results:

CustomerID       totalamt       Rank       DenseRank
							----------------------------------------------------
678              1185100        1          1
697              1179400        2          2
170              1134700        3          3
328              1084400        4          4
[more rows]
87               213300         170        170
667              210600         171        171
196              207700         172        172
451              206100         173        173
672              206100         173        173
272              05200          175        174
687              205200         175        174
163              204000         177        175
102              203900         178        176
[more rows]

Notice that customers 451 and 672 are tied, with the same total sales amount. They are ranked 173 by both the RANK() and DENSE_RANK() functions. What happens next is what’s important. Customers 27 and 687 are tied for the next position, and they are both assigned 175 by RANK() and 174 by DENSE_RANK(). Customer 163 is the next non-tie, and it is assigned 177 by RANK() and 175 by DENSE_RANK().

NTILE(n) divides the returned rows into approximately evenly sized groups, the number of which you specify as a parameter to the function. It assigns each member of a group the same number in the result set. A perfect example of this is the percentile ranking in a college examination or a road race.

Listing 10. Ranking with NTILE()

SELECT SalesOrderID, CustomerID,
   NTILE(10000) OVER (ORDER BY CustomerID) AS RunningCount
FROM Sales.SalesOrderHeader
WHERE SalesOrderID > 10000

The results are shown here:

SalesOrderID      CustomerID     RunningCount
							---------------------------------------------
43860             1              1
44501             1              1
45283             1              1
46042             1              1
46976             2              2
47997             2              2
49054             2              2
50216             2              2
51728             2              3
57044             2              3
63198             2              3
69488             2              3
44124             3              4
[more rows]

Using All the Ranking Functions Together

So far we have looked at the ranking functions in isolation. The ranking functions are just regular SQL Server expressions, so you can have as many of them as you want in a single SELECT statement. We’ll look at one last example that brings these all together into one SQL statement and shows the differences between the four ranking functions.

Listing 11. Contrasting SQL Server 2005 ranking functions

--Ranking All
USE AdventureWorks
GO
SELECT SalesOrderID AS OrderID, CustomerID,
   ROW_NUMBER() OVER(ORDER BY CustomerID) AS RowNum,
   RANK() OVER (ORDER BY CustomerID) AS Rank,
   DENSE_RANK() OVER (ORDER BY CustomerID) as DRank,
   NTILE(10000) OVER (ORDER BY CustomerID) as NTile
FROM Sales.SalesOrderHeader
WHERE SalesOrderID>10000
ORDER BY CustomerID

The results are shown here:

OrderID      CustomerID      RowNum      Rank      Drank      NTile
							--------------------------------------------------------------------
43860        1               1           1         1          1
44501        1               2           1         1          1
45283        1               3           1         1          1
46042        1               4           1         1          1
46976        2               5           5         2          2
47997        2               6           5         2          2
49054        2               7           5         2          2
50216        2               8           5         2          2
51728        2               9           5         2          3
57044        2               10          5         2          3
63198        2               11          5         2          3
69488        2               12          5         2          3
44124        3               13          13        3          4
44791        3               14          13        3          4
[more rows]

Ranking over Groups: PARTITION BY

The ranking functions can also be combined with windowing functions. A windowing function divides a result set into equal partitions based on the values of your PARTITION BY statement in conjunction with the OVER clause in your ranking function. This is like applying a GROUP BY to your ranking function—you get a separate ranking for each partition. The following example uses ROW_NUMBER with PARTITION BY to count the number of orders by order date by salesperson. We do this by using PARTITION BY SalesPersonID OVER OrderDate. You can do this with any of the four ranking functions.

Listing 12. Ranking over groups with PARTITION BY

USE AdventureWorks
GO
SELECT SalesOrderID, SalesPersonID, OrderDate,
ROW_NUMBER() OVER (PARTITION BY SalesPersonID ORDER BY OrderDate) AS OrderRank
FROM Sales.SalesOrderHeader
WHERE SalesPersonID is not null

The results are shown here. You might find that the order of your rows varies slightly from the results shown.

SalesOrderID       SalesPersonID       OrderDate                    OrderRank
							-----------------------------------------------------------------------------
[more rows]
43659              279                 2001-07-01 00:00:00.000      1
43660              279                 2001-07-01 00:00:00.000      2
43681              279                 2001-07-01 00:00:00.000      3
43684              279                 2001-07-01 00:00:00.000      4
43685              279                 2001-07-01 00:00:00.000      5
43694              279                 2001-07-01 00:00:00.000      6
43695              279                 2001-07-01 00:00:00.000      7
43696              279                 2001-07-01 00:00:00.000      8
43845              279                 2001-08-01 00:00:00.000      9
43861              279                 2001-08-01 00:00:00.000      10
[more rows]
48079              287                 2002-11-01 00:00:00.000      1
48064              287                 2002-11-01 00:00:00.000      2
48057              287                 2002-11-01 00:00:00.000      3
47998              287                 2002-11-01 00:00:00.000      4
48001              287                 2002-11-01 00:00:00.000      5
48014              287                 2002-11-01 00:00:00.000      6
47982              287                 2002-11-01 00:00:00.000      7
47992              287                 2002-11-01 00:00:00.000      8
48390              287                 2002-12-01 00:00:00.000      9
48308              287                 2002-12-01 00:00:00.000      10
[more rows]

Let’s partition our ranking function by country. We’ll create a CTE to aggregate the sales by customer and by country. Then we will apply the ranking function over the TotalAmt field and the Customer ID field, partitioned by the CountryName.

Listing 13. Aggregates with PARTITION BY

USE AdventureWorks
GO
WITH CTETerritory
AS
(
SELECT cr.Name AS CountryName, CustomerID, SUM(TotalDue) AS TotalAmt
FROM Sales.SalesOrderHeader soh
   INNER JOIN Sales.SalesTerritory ter
      ON soh.TerritoryID=ter.TerritoryID
   INNER JOIN Person.CountryRegion cr
      ON cr.CountryRegionCode=ter.CountryRegionCode
GROUP BY cr.Name, CustomerID
)
SELECT *, RANK() OVER(PARTITION BY CountryName ORDER BY TotalAmt, CustomerID DESC)
As Rank
FROM CTETerritory

The results look like this:

CountryName       CustomerID       TotalAmt      Rank
							-----------------------------------------------------
Australia         29083            4.409         1
Australia         29061            4.409         2
Australia         29290            5.514         3
Australia         29287            5.514         4
Australia         28924            5.514         5
[more rows]
Canada            29267            5.514         1
Canada            29230            5.514         2
Canada            28248            5.514         3
Canada            27628            5.514         4
Canada            27414            5.514         5
[more rows]
France            24538            4.409         1
France            24535            4.409         2
France            23623            4.409         3
France            23611            4.409         4
France            20961            4.409         5
[more rows]

PARTITION BY supports other SQL Server aggregate functions, including MIN and MAX as well as your own scalar functions. You can apply your aggregate function in the same way that you apply the ranking functions, with a PARTITION BY statement. Let’s apply this technique to the current sample by adding a column to our results set using the AVG function. We will get the same results but with an additional column showing the average by country:

Listing 14. Using AVG with PARTITION BY

USE AdventureWorks
GO
WITH CTETerritory
AS
(
SELECT cr.Name AS CountryName, CustomerID, SUM(TotalDue) AS TotalAmt
FROM Sales.SalesOrderHeader soh
   INNER JOIN Sales.SalesTerritory ter
      ON soh.TerritoryID=ter.TerritoryID
   INNER JOIN Person.CountryRegion cr
      ON cr.CountryRegionCode=ter.CountryRegionCode
GROUP BY cr.Name, CustomerID
)
SELECT *, RANK() OVER(PARTITION BY CountryName ORDER BY TotalAmt, CustomerID DESC)
As Rank,
AVG(TotalAmt) OVER(PARTITION BY CountryName) AS Average
FROM CTETerritory

Here are the results:

CountryName       CustomerID      TotalAmt      Rank      Average
							-------------------------------------------------------------------
Australia         29083           4.409         1         3364.8318
Australia         29061           4.409         2         3364.8318
Australia         29290           5.514         3         3364.8318
[more rows]
Canada            29267           5.514         1         12824.756
Canada            29230           5.514         2         12824.756
Canada            28248           5.514         3         12824.756
[more rows]

Other

Exploring the T-SQL Enhancements in SQL Server 2005 : TOP Enhancements

Blind SQL Injection Exploitation : Using Time-Based Techniques

Finding and Confirming Blind SQL Injection

SQL Server 2008 : Configuration Options

SQL Server 2008 Instance Architecture

SQL Server 2008 : Using the CLR - CLR Integration

SQL Server 2008 : Using the CLR - CLR and Managed Code Explained

SQL Azure : Database Growth-Management Strategies

SQL Azure : Database-Migration Strategies

SQL Server 2005 : Implementing Service Broker