We looked at the basics of SQL Server 2005's Common Table Expression, i.e. a CTE, in the previous article in this series. In this article, we will look at using a CTE to implement a hierarchical query. This article assumes that you are familiar with the basic syntax and operation of a CTE. You can look at the previous article in this series, SQL Server 2005 Recursive Functions to refresh yourself on this topic.
A hierarchical query is useful whenever you must ask questions of tree-structured data, especially when the tree is not balanced. The figure below shows a simple example of this kind of structured data, an organization chart.
There are a number of questions you might ask of this structure; for example, "What is the total number of people reporting to Bob?" (5)
A database can represent an organization chart as a table with a self-referencing index. The figure below shows an "Employees" table that does this.
The Employees table has a primary key for the employee Id and the Boss column has a foreign key that references the employee Id of the Employees boss. There is a convention that the Id and Boss columns of the big boss are the same.
In order to see the basics of a hierarchical query let's start by looking at one that answers the question we posed; "How many people report to Bob?" The format for a CTE that does a hierarchical query is similar to that used for a recursive function described in the previous article; it has two SELECT statements separated by a UNION ALL phrase. The figure below shows a CTE used to find the number of people who report to Bob.
A hierarchical query also has the same three parts as a recursive function does; initialization, recursive execution, and test for completion. The query starts by executing the SELECT statement before the UNION ALL phrase just once. This is the initialization of the hierarchical query and produces a number of rows; the ones that contain employees whose Boss column equals two.
The SELECT statement after the UNION ALL phrase is recursively executed until it produces no rows. In a CTE a SELECT statement that JOINs the CTE always JOINs with the rows produced by the previously executed SELECT statement. The rows resulting from each SELECT statement executed in the CTE accumulate progressively in the Reports CTE and are available upon the completion of the CTE to the SELECT statement that follows it. Therefore, each recursive execution adds to the Reports CTE all the employees that have a Boss column that is equal to an ID column in the rows produced by the previous SELECT statement.
The test for completion is in the ON clause of the recursive execution; when none of the Boss columns in the Employees table are found in the ID column of the Reports CTE, the recursion stops.
Once the recursion has stopped, the Reports CTE contains all the employees. The SELECT statement that follows the CTE counts the number of rows in the Reports CTE, which is the number of people who report to Bob. The data from the CTE can be inserted into an SQL file.
The figure below shows the flow of execution of SELECT statements in this CTE and the progressive construction of the Reports CTE.
The initialization SELECT produces the ID's 10 and 14 because those rows in the Employees table have a Boss column value of 2. These rows are added to the Reports CTE.
The first recursive execution, SELECT, JOINs the Employees table with the rows produced by the SELECT statement. It produces IDs 3,7, and 5 and adds them to the Reports CTE because previous SELECT statement produced Id values that correspond to the Boss values in the rows with IDs 3, 7, and 5.
The second recursive execution SELECT statement produces no rows because none of the rows produced by the first recursive SELECT statement has an ID value that correspond to the Boss value columns in any of the rows in the Employees tables. In this case, the test for completion succeeds and recursive execution stops.
The SELECT statement following the CTE counts the number of rows produced by all the SELECT statements executed in the CTE itself. The reference to "Reports" in the statement following the CTE acts like a temporary table that holds all the rows produced by the CTE and has the IDs of the five people who report to Bob in it.
The CTE we have been examining has a flaw in it, and it is potential flaw in any CTE -- it will fail if we ask it to find the number of people who report to the big boss, Carol. The figure below shows the result of running the CTE where the big boss's ID is used to initialize it.
Because Carol is her own boss, the recursive SELECT statement will always find a boss for Carol and therefore will run forever. To prevent this from overwhelming SQL Server 2005, it has a default recursion limit of 100. In some cases you may want to use a different limit. To do this you add an OPTION clause in the SELECT statement that follows the CTE to change the MAXRECURSION. This can be set to any number from 0 to 32,767. Note that 0 means there is no limit to recursion. The figure below shows the usage of MAXRECURSION.
Notice that this does not really fix our problem, it just causes the failure to happen a bit sooner. A typical problem with a recursive query is that the test for completion is not correct, and that is the case here. Because of the convention we have chosen for marking the big boss, we either must specifically not select the big boss when looking for reports or use a different convention for the big boss. The figure below shows the CTE corrected to take into account the big boss convention.
Changing the convention for marking the big boss to making the Boss column NULL also fixes this problem. The important point here is that a recursive query has the potential to recurse forever, and only be careful design of your tables and query can you prevent this from happening. There is no simple test that can guarantee infinite recursion cannot occur.
There still is one fly in this ointment; the hardcoded value in the initialization SELECT statement. It is useful to wrap a hierarchical query in a function so that it can be initialized parametrically. The figure below shows the CTE we have been working on wrapped in table valued function.
Now that the hierarchical query is wrapped in a function it can be used just as any other function. Note that the WITH keyword must be preceded by a ";" if it is not the first statement in a batch. Now we can parametrical set the initialization and get the number of people reporting to anyone in the database. For example:
SELECT dbo.Reports((SELECT Id FROM Employees WHERE Name = 'Bob'))
returns the number of employees that report to Bob. We are using a scalar SELECT expression to get the Id for "Bob" and pass it into the Reports function. Note that this example assumes there is only on Bob in the organization. However, you can still use the Reports function to find out how many people report to each person in the organization. The following query:
SELECT Name, dbo.Reports(Id) from Employees
Ellen 0 Frank 0 Bob 5 Dave 3 Ira 0 Alice 7 Carol 0 Helen 0
This query lists the name of each person in the organization and the number of people that report to them. If there were two Bob's in the organization it would list the number of reports to each separately.
In summary a CTE can be used to implement a hierarchical query. The SELECT statement before the UNION ALL is executed first and just once. The SELECT statement after the UNION ALL is executed recursively until it produces no rows. The recursive query can reference the rows produced by the previous query. The SELECT statement that follows the CTE can reference all of the rows produced by the SELECT statements in the CTE. Check your recursive query closely to make sure that it will not recurse infinitely.
About the author
Dan Sullivan started his professional life at Digital Equipment Corporation working on the development of the PDP-11's. Later, he was the Director of Engineering at Data Translation Incorporated, which developed software and hardware data acquisition products for PDP-11's and introduced their first products for the IBM PC. Sullivan was also the Chief Engineer at Delta Lab Research and introduced their first microprocessor-based sound equipment. For the past 15 years, Sullivan has been an independent software consultant, initially for DOS and now for Microsoft Windows NT. His consulting projects spanned the range from GUI-based products for the electronics manufacturing industry to software support for high data rate MPEG2 processing systems. Almost all the projects he has worked on wove together a database, GUI, engineering data and performance, and required mentoring the customer's programming staff. He is the president of Danal Technology Inc., his consulting company.