SQL Server 2008 : Explaining Advanced Query Techniques - Creating CTEs

1/21/2011 2:38:04 PM

The syntax for using CTEs is broken down into two parts: the declaration of the CTE itself and the statement that references the CTE that was just declared:

-- CTE DECLARATION
WITH CTE_name [ ( column_name [,...n] ) ]

AS
( CTE_query_definition )
-- CTE USE
SELECT <column_list>
FROM CTE_NAME

The entire block of syntax in this example represents a single statement, and some syntax rules particular to CTEs should be pointed out:

When the CTE is not the first statement in a batch, the statement preceding the CTE must have been terminated with a semicolon (“;”). See the sidebar titled “The Transact-SQL Statement Terminator” for more information.
The list of “column_name” values in the CTE declaration is optional unless there are expressions in the CTE_query_definition that do not have explicit column names provided.
The CTE definition is valid only for a single statement. If we want to use the same CTE for multiple statements, we must restate the CTE declaration for each of the multiple queries.

New & Noteworthy...: The Transact-SQL Statement Terminator

The semicolon (“;”) Transact-SQL statement terminator has historically been an optional syntax element in SQL Server. However as the language becomes more involved, the parser needs some help in properly interpreting our code. SQL Server 2008 Documentation states that the semicolon will be required in a future version. It would be in our best interest then to begin the habit of terminating all SQL statements with a semicolon now.

CTEs are an example of where the semicolon is required on the statement that precedes the CTE. A CTE’s definition begins with a WITH clause. As you will see later in this chapter, the SELECT statement can also have a WITH clause for a completely different purpose when using table locking and index hints. To help the SQL parser understand your intended use of WITH when creating CTEs, you must make sure that any previous statements in the same batch were properly terminated with a semicolon. It is for this reason that you may see some developers precede the CTE’s WITH clause using a semicolon like this, “;WITH ”. It’s important to understand that the semicolon is not a part of the CTE’s declaration but rather the explicit termination of the statement that precedes the CTE’s declaration in the batch.

If you fail to properly terminate the previous statement before declaring a CTE, SQL Server will let you know with the following error message:

Msg 319, Level 15, State 1, Line 4
Incorrect syntax near the keyword 'with'. If this statement is a
common table expression, an xmlnamespaces clause or a change tracking
context clause, the previous statement must be terminated with a
semicolon.

Let’s look at a sample CTE. The following CTE and its referencing query answer the question, “How many products have sold a total quantity of greater than 5000 items across all orders?”

USE AdventureWorks2008;
GO
--CTE DECLARATION
WITH ProductsOver5000 AS
(
   SELECT ProductID, SUM(OrderQty) AS TotalQty
   FROM Sales.SalesOrderDetail
   GROUP BY ProductID
   HAVING SUM(OrderQty) > 5000
)
--CTE USE
SELECT COUNT(*) AS NumProductsOver5000
FROM ProductsOver5000;

The query that references the CTE is clean and easy to read. It simply references the ProductsOver5000 CTE declared above as its source. It is this clarity in the final query that often makes CTEs much easier to understand than a functionally equivalent statement written using subqueries. Let’s try the same query using a subquery rather than a CTE:

USE AdventureWorks2008;
GO
SELECT COUNT(*) AS NumProductsOver5000
FROM
(
  SELECT ProductID, SUM(OrderQty) AS TotalQty
  FROM Sales.SalesOrderDetail
  GROUP BY ProductID
  HAVING SUM(OrderQty) > 5000
) AS ProductsOver5000;

Many people find the subquery version to be more difficult to understand because the “inner” query is stated in the middle of the “outer” query. It can make it difficult to figure out where one query starts and the other stops.

A single statement can have multiple CTEs declared. If that is the case, all CTE declarations must occur before the statement that references them. When defining multiple CTEs in a single statement, the WITH keyword only appears once, and each CTE declaration is separated by a comma (“,”). A benefit of multiple CTEs is that one CTE can build on the results of a previous CTE, much in the same way that nested subqueries are used. Here is an example of a query that uses multiple CTEs to answer the question, “What products were sold on the last day of business, and what was the total quantity sold for each product?”:

USE AdventureWorks2008;
GO
WITH
-- First CTE: LastDayOrderIDs
-- Get the SalesOrderID values for all orders placed on the last day of
-- business (MAX(OrderDate))
LastDayOrderIDs AS
(
   SELECT SalesOrderID FROM Sales.SalesOrderHeader
   WHERE OrderDate = (SELECT MAX(OrderDate) FROM Sales.SalesOrderHeader)
),
-- Second CTE: LastDayProductQuantities
-- Now find the distinct ProductIDs and their total quantities for all
-- of the SalesOrderID values retrieve in the previous CTE
LastDayProductQuantities AS
(
   SELECT ProductID, SUM(OrderQty) AS LastDayQuantity
   FROM Sales.SalesOrderDetail
   JOIN LastDayOrderIDs
   ON Sales.SalesOrderDetail.SalesOrderID = LastDayOrderIDs.SalesOrderID
   GROUP BY ProductID
),
-- Third CTE: LastDayProductDetails
-- Take the ProductIDs and last day quantities from the previous
-- CTE and join it to the Product table to retrieve the product names
LastDayProductDetails AS
(
   SELECT
      Product.ProductID,
      Product.Name,
      LastDayQuantity
   FROM Production.Product
   JOIN LastDayProductQuantities
   ON Product.ProductID = LastDayProductQuantities.ProductID
)
-- Finally return the results of the last CTE
SELECT * FROM LastDayProductDetails
ORDER BY LastDayQuantity DESC;

The preceding query may not be the most efficient syntax for retrieving the desired result, but it does give us a way to break what may seem like a complex query down into smaller bite-size pieces (CTEs). It also allows us to do it in a top-down style notation rather than the sometimes more confusing inside-out subquery syntax. It is just another way to approach solving a problem and may allow you think around obstacles you encounter in other methods.

Tip

When taking a test you are provided with a note sheet and a pen. A helpful tip might be to write down key things you want to remember onto the note sheet before you start the test. This can help you focus your thoughts before the test as well as provide you with reminders once the test starts.

Whether we choose to write our statements using CTEs or subqueries is often just a matter of personal preference. If it is possible for us to write a statement using either a CTE or a subquery, we will probably find little difference between the CTE version of the statement when compared to the subquery version of the statement as far as SQL Server’s optimizer is concerned. Assuming they both imply the same intentional result, we will likely get the same execution plan returned for both styles of syntax. Of course, that is a generalization, and when we do have a choice we should always test to see if one version of the statement has a better execution plan than the other. The CTE becomes a requirement and not an option when we want to reference the CTE multiple times in the same query (as in a self-join situation), when we want the multiple CTEs to build on the single “base” CTE, and when we want the CTE’s query to be recursive.

Creating Recursive CTEs

Recursive queries are a powerful feature of CTEs. With recursive CTEs we have the ability to perform in a single set-based statement what in previous versions has typically required the implementation of procedures or functions. The recursive nature of CTEs comes from their ability to be self-referencing; something neither subqueries nor views can do. Recursive queries, though powerful, can sometimes be difficult to visualize, and if we aren’t careful can generate “runaway” code. SQL Server prevents this from happening by default, but we need to be aware of what recursion means and learn how to control it. We will cover controlling recursion when we discuss the MAXRECURSION option at the end of this section. If you are new to the concept of recursion, read the sidebar titled “Recursion” for more information.

Head of the Class...: Recursion

Recursion is a useful programming practice for writing code that calls itself. Recursion provides a powerful mechanism for processing trees and hierarchies, building running totals, calculating factorials, and so on.

Recursion is a common tool used in many languages. SQL Server has supported recursion in stored procedures and triggers in previous versions, but CTEs now give us the ability to create recursive queries. As explained in this chapter, CTEs are named queries that can be referenced from a SQL statement. The name of a CTE can in fact be referenced from within its own query definition. By following some specific syntax rules, SQL Server can process the self-referencing CTE query definitions and traverse the parent/child relationships that are so common in relational databases.

A common example of a recursive CTE in the SQL world is one that uses the relationship between an employee and a manager. In this scenario a manager has employees, but those employees may also be managers of other employees, and so on and so forth. With a recursive query, we can retrieve the top-level (Level 0) manager in the “Anchor query” and then in the recursive operation find all the employees that report to that manager (Level 1), iterate the recursive query again to find all the employees for that report to the Level 1 employees (Level 2), and again, and again.

The structure of a recursive CTE is similar to that of a regular CTE with the exception of the CTE’s query definition. Here is the basic layout of a recursive CTE query:

-- CTE DECLARATION
WITH CTE_name [ ( column_name [,...n] ) ]
AS
(
CTE_Anchor_member_query_definition
UNION ALL
CTE_Recursive_member_query_definition
)
-- CTE USE
SELECT <column_list>
FROM CTE_NAME

The preceding syntax shows that a recursive CTE’s query definition has two primary components: the “anchor member” and the “recursive member.” The anchor member is the query that returns rows that represent the top-level items in a hierarchy. For example, when using CTEs to recursively iterate through the manager/employee relationships in an organization, the anchor member query will return only the top-level employee (or employees if there are multiple people at the very top of the organizational chart). Anchor member queries specifically may not reference the CTE. Referencing the CTE is what the recursive member does.

The “recursive member” is the query that returns the details for the lower level items and maps to the “parents” of those records by joining back to the CTE itself. The recursive member is the query that references the CTE by its name and causes SQL server to perform the recursion.

The anchor member and the recursive member are separated by a UNION ALL operator. This continuously attaches the rows from each recursion to the final result set.

If you have read up on recursive CTE queries before, you have likely run across the standard manager / employee recursive query using the AdventureWorks. HumanResources.Employee table from SQL Server 2005. SQL Server 2008 has significantly changed the structure of the HumanResources.Employee table and how the hierarchy itself is represented. The most notable differences are the new BusinessEntityID and OrganizationNode columns. The OrganizationNode column uses the new HierarchyID data type. A discussion of it is beyond the scope of this chapter, but you can refer to the SQL Server 2008 documentation for details on it. Without further ado, here is a sample recursive query:

USE AdventureWorks2008;
GO
WITH OrgChart --Notice the name of the CTE is "OrgChart"
(OrganizationNode,LoginID, ManagerLoginID, Level)
AS
(
    --Anchor Member query:
    SELECT OrganizationNode, LoginID, CAST(NULL AS nvarchar(256)),
       OrganizationNode.GetLevel()
    FROM HumanResources.Employee
    WHERE OrganizationNode = HierarchyID::GetRoot()
    UNION ALL
--Recursive Member Query:
SELECT E.OrganizationNode, E.LoginID, M.LoginID,
E.OrganizationNode.GetLevel()
FROM HumanResources.Employee AS E
JOIN OrgChart AS M

ON E.OrganizationNode.GetAncestor(1) = M.OrganizationNode
)
SELECT * FROM OrgChart;

Let’s dissect this query a little bit. First, we see that the name of the CTE is OrgChart. Next is the explicit list of column names that the query will return. In this example it is useful to state the column aliases here rather than in the query definitions inside the CTE’s declaration. It just helps clean things up a little. Also in this query, only the LoginIDs of the employees and managers are being retrieved. This keeps the query simple and clean, yet still gives us a sense of “who” the managers and employees are. Remember, this is just an example.

The anchor member query returns details for employees who are at the top level of the organization. In this case we are determining that an employee is at the top of the organization by testing to see if the employee’s OrganizationNode column value is the same as the root value of the HierarchyID data type (HierarchyID::GetRoot()). The last column returns the “Level” of the employee by calling the GetLevel() method of the OrganizationNode HierarchyID value (OrganizationNode.GetLevel()), When SQL Server processes the anchor member query, a single row is returned:

OrganizationNode    LoginID                   ManagerLoginID             Level
----------------    ------------------------   -----------------------    -----
0x                   adventure-works\ken0      NULL                       0

Note that in the row returned from the anchor member query definition, the ManagerLoginID value is NULL. This is because members at the top level of the hierarchy by definition do not have managers. Therefore, any manager data for those employees is empty, or NULL.

SQL server then moves to process the recursive member. The recursive member query joins back to the CTE by its name. In our example, the CTE now represents the row that was returned from the anchor member. SQL Server finds all employees that have an OrganizationNode value with the ancestor 1 level up (OrganizationNode.GetAncestor(1)) having the same value as the manager’s OrganziationNode value. The first pass through the recursive query then returns all employees who report to “adventure-works\ken0”:

OrganizationNode    LoginID                    ManagerLoginID          Level
----------------    ----------------------     --------------------    -----
0x58                adventure-works\terri0     adventure-works\ken0    1
0x68                adventure-works\david0     adventure-works\ken0    1
0x78                adventure-works\james1     adventure-works\ken0    1

0x84                adventure-works\laura1     adventure-works\ken0    1
0x8C                adventure-works\jean0      adventure-works\ken0    1
0x94                adventure-works\brian3     adventure-works\ken0    1

SQL Server then continues to iterate the recursive member query for each row returned by its previous iteration. For example, the next iteration would join any employees that report to “adventure-works\brian3” to Brian’s record and show his direct reports, giving us the following additional rows:

Code View: Scroll / Show All

OrganizationNode    LoginID                   ManagerLoginID           Level
----------------    -----------------------   ---------------------    -----
...                 ...                       ...                      ...
0x9560              adventure-works\stephen0  adventure-works\brian3   2
0x95A0              adventure-works\syed0     adventure-works\brian3   2
0x95E0              adventure-works\amy0      adventure-works\brian3   2
...                 ...                       ...                      ...

SQL Server then continues recursing into the CTE for each row returned to find that row’s employees until all rows have been exhausted or the maximum depth of recursion has been reached. By default, SQL Server supports 100 levels of recursion along a single “branch” in a hierarchy. If we have hierarchies with branches that are deeper than 100 levels, we will need to override the default using the MAXRECURSION query option. Query options will be explained in more detail later in this chapter. We can tell if we have exceeded the MAXRECURSION by the system returning the following error message:

Msg 530, Level 16, State 1, Line 1
The statement terminated. The maximum recursion 3 has been exhausted
before statement completion.

Warning

Don’t forget about MAXRECURSION. Make sure to pay attention to queries that could repeat indefinitely during the exam. Recursive queries that go deeper than 100 levels along any branch will cause errors unless the MAXRECURSION option is used.

In Exercise 1 you will create a recursive CTE and test the effect of the MAXRECURSION option.

Exercise 1. Creating A Recursive CTE

Open a new query in SQL Server Management Studio
Enter the following code into the window:
USE AdventureWorks2008;
GO
WIH OrgChart (OrganizationNode,LoginID,MLoginID) AS
(SELECT OrganizationNode, LoginID, CAST(NULL AS nvarchar(256))
FROM HumanResources.Employee
WHERE OrganizationNode = HierarchyID::GetRoot ()
UNION ALL
SELECT E.OrganizationNode, E.LoginID, M.LoginID
FROM HumanResources.Employee AS E
JOIN OrgChart AS M
ON E.OrganizationNode.GetAncestor(1) = M.OrganizationNode)
SELECT * FROM OrgChart
OPTION(MAXRECURSION 4);
Run the query and view the results.
Try the same query that is shown in step 2, but change the MAXRECURSION value from 4 to 3.

Other

SQL Server 2008 : Multiple Sources with SQL Server Integration Services

SQL Server 2008 : Working with Multiple-Source Queries - OpenQuery, OpenRowSet, and OpenDataSource Explained

SQL Server 2008 : Working with Multiple-Source Queries - Using Four-Part Database Names & The DTC Explained

SQL Server 2008 : Working with Multiple-Source Queries - Using Linked Servers

OData with SQL Azure - Enabling OData on an Azure Database

OData with SQL Azure - OData Overview

SQL Server 2008 : Performing Nonlogged DML Operations

SQL Server 2008 : Using the OUTPUT Clause with the MERGE Statement

SQL Server 2008 : Returning Data from DML Operations Using the OUTPUT Clause

SQL Server 2008: Working with System Databases