Programming Microsoft SQL Serve 2005 : An Overview of SQL CLR - CLR Aggregates

2/24/2011 4:32:47 PM

T-SQL has a number of built-in aggregates, such as SUM(), AVG(), and MAX(), but that set of built-in functions is not always sufficient. Luckily, the SQL CLR features in SQL Server 2005 allow us to implement user-defined aggregates in .NET code and use them from T-SQL. User-defined aggregates can be implemented only in SQL CLR code; they have no T-SQL equivalent. Because aggregates tend to perform computation only, they provide an excellent use case for SQL CLR code. As it turns out, they are also quite easy to build.

At first, aggregates feel and look like functions because they accept and return values. In fact, if you use an aggregate in a non-data-querying T-SQL call (for example, SELECT SUM(8)), you are in fact treating the aggregate as if it were a function. The thing to remember is that the argument passed to an aggregate is typically a column, and so each discrete value for that column, for whichever WHERE, HAVING, ORDER BY, and/or GROUP BY scope applies, gets passed into the aggregate. It is the aggregate’s job to update a variable, which eventually will be the return value, as each discrete value is passed to it.

CLR aggregates require you to apply the SqlUserDefinedAggregate attribute to them. The SqlUserDefinedAggregate attribute accepts a number of parameters, but all of them are optional except Format. In our example, we will use the value Format.Native for the Format parameter. For more advanced scenarios, you might want to study SQL Server Books Online to acquaint yourself with the other parameters this attribute accepts. Sticking with Format.Native for the Format parameter is sufficient for many scenarios.

Unlike the SqlProcedure, SqlFunction, and SqlTrigger attributes, the SqlUserDefinedAggregate attribute is required by SQL Server for your class to be eligible for use as an aggregate. Visual Studio SQL Server projects do use this attribute for deployment, and the attribute is included in the aggregate template, but it also must be used in generic Class Library project code in order for T-SQL registration of the aggregate to succeed.

Aggregate classes must have four methods: Init, Accumulate, Merge, and Terminate. The Accumulate method accepts a SQL type, the Terminate method returns one, and the Merge method accepts an object typed as the aggregate class itself.

The Accumulate method handles the processing of a discrete value into the aggregate value, and the Terminate method returns the final aggregated value after all discrete values have been processed. The Init method provides startup code, typically initializing a class-level private variable that will be used by the AccumulateMerge method is called in a specific multi-threading scenario, which we will describe later on. method. The

Just to be perfectly clear, your aggregate class will not implement an interface to supply these methods; you must create them to meet what we might term the “conventions” that are expected of SQL CLR aggregate classes (as opposed to a “contract” with which they must comply). When you develop your code in a Visual Studio 2005 SQL Server project, the Aggregate template includes stubs for these four methods as well as the proper application of the SqlUserDefinedAggregate attribute.

Creating your own aggregates is fairly straightforward, but thinking through aggregation logic can be a bit confusing at first. Imagine you want to create a special aggregate called BakersDozen that increments its accumulated value by 1 for every 12 units accumulated (much as a baker, in simpler times, would throw in a free 13th donut when you ordered 12). By using what you now know about CLR aggregates and combining that with integer division, you can implement a BakersDozen aggregate quite easily. Listing 1, the code from struct BakersDozen in aggTest.cs in the sample project, contains the entire implementation of the aggregate BakersDozen.

Listing 1. struct BakersDozen from aggTest.cs

[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(Format.Native)]
public struct BakersDozen
{
    private SqlInt32 DonutCount;

    public void Init()
    {
        DonutCount = 0;
    }

    public void Accumulate(SqlInt32 Value)
    {
        DonutCount += Value + ((Int32)Value) / 12;
    }

    public void Merge(BakersDozen Group)
    {
        DonutCount += Group.DonutCount;
    }

    public SqlInt32 Terminate()
    {
        return DonutCount;
    }

}

The code here is fairly straightforward. The private variable DonutCount is used to track the BakersDozen-adjusted sum of items ordered, adding the actual items- ordered value and incrementing the running total by the integer quotient of the ordered value divided by 12. By this logic, bonus items are added only when an individual value equals or exceeds a multiple of 12. Twelve includes a full dozen, and so would 13. Twenty-four includes two dozen, and so would 27. Two individual orders of 6 items each would not generate any bonus items because a minimum of 12 items must be ordered in a line item to qualify for a bonus.

To deploy the aggregate, use attribute-based deployment in the SQL Server project or the following command for the Class Library version:

CREATE AGGREGATE BakersDozen
        (@input int)
RETURNS int
EXTERNAL NAME Chapter03.BakersDozen

Notice that no method name is specified because the aggregate is implemented by an entire class rather than an individual function. Notice also that the return value data type must be declared as the data type of the values this aggregate function will process. The @input parameter acts as a placeholder, and its name is inconsequential. Note that aggregates can be built on SQL CLR types (covered in the next section) as well as SQL scalar types.

Note

The preceding CREATE AGGREGATE command for the Class Library project version of the sample code is contained in the CreateObjects.sql script in the Management Studio project supplied with the sample code.

To see the aggregate work, first run the CreateTblAggregateTest.sql script file in the Management Studio sample project to create a table called AggregateTest with columns OrderItemId, OrderId, and ItemsOrdered and several rows of data, as shown here:

CREATE TABLE tblAggregateTest(
    [OrderItemId] [int] IDENTITY(1,1) NOT NULL,
    [OrderId] [int] NULL,
    [ItemsOrdered] [int] NOT NULL
)
GO

INSERT INTO tblAggregateTest VALUES (1,2)
INSERT INTO tblAggregateTest VALUES (1,4)
INSERT INTO tblAggregateTest VALUES (2,1)
INSERT INTO tblAggregateTest VALUES (2,12)
INSERT INTO tblAggregateTest VALUES (3,3)
INSERT INTO tblAggregateTest VALUES (3,2)

With such a table built, use the following T-SQL DDL command in your Visual Studio test script or a Management Studio query window to test the aggregate function:

SELECT
  OrderId,
  SUM(ItemsOrdered) AS SUM,
  dbo.BakersDozen(ItemsOrdered) AS BakersDozen
FROM tblAggregateTest
GROUP BY OrderId

For each distinct value in the OrderId column, this query effectively uses our CLR code under the following algorithm:

Call Init().
Call Accumulate once for each row with the same OrderId value, passing it that row’s value of the ItemsOrdered column.
Call Terminate upon a change in the OrderId value to retrieve the aggregated value that the query will pipe to the client.

The results should be as follows:

OrderId     SUM         BakersDozen
						----------- ----------- -----------
1           6           6
2           13          14
3           5           5

By including the built-in T-SQL aggregate SUM in our query, we can see how many bonus items were added. In this case, for OrderId 2, a single bonus item was added, due to one row in the table with the following values:

OrderItemId OrderId     ItemsOrdered
----------- ----------- ------------
4           2           12

All the other rows contain ItemsOrdered values of less than 12, so no bonus items were added for them.

Because SQL Server sometimes segments the work required to satisfy a query over multiple threads, the query processor might need to execute your aggregate function multiple times for a single query and then merge the results together. For your aggregate to work properly in this scenario, you must implement a Merge method.

The Merge method takes the result of one thread’s aggregation and merges it into the current thread’s aggregation. The calculation required to do this could be complicated for some aggregates; in our case, we simply added the DonutCount value from the secondary thread’s aggregate (accessible via the Group input parameter) to our own. There is no need to add bonus items because they would have been added in the individual Accumulate calls on the secondary thread. Simple addition is all that’s required. An aggregate that calculated some type of average, or tracked the largest value in the data series supplied, for example, would require more complex merge code.

Don’t forget that aggregates can be passed scalar values and can be used from T-SQL without referencing a table. Your aggregate must accommodate this scenario, even if it seems impractical. In the case of BakersDozen, single scalar values are easily handled. To see for yourself, try executing the following table-less T-SQL query:

SELECT dbo.BakersDozen(13)

You will see that it returns the value 14.

Note

The TestAggregate.sql script file in the Management Studio project contains both aggregate-testing queries.

Aggregates are an excellent use of SQL CLR programming. Because they are passed data values to be processed, they typically perform only computational tasks and no data access of their own. They consist of compiled CLR code, so they perform well, and unlike stored procedures, triggers, and functions, they cannot be implemented at all in T-SQL. That said, you must still make your aggregate code, especially in the Accumulate method, as “lean and mean” as possible. Injecting your own code into the query processor’s stream of work is an honor, a privilege, and a significant responsibility. Take that responsibility seriously, and make sure that your code is as low-impact as possible.

Other

SQL Server 2008: Monitoring Resource Governor

SQL Server 2008: Managing Resources with the Resource Governor (part 3) - Classifier Function

SQL Server 2008: Managing Resources with the Resource Governor (part 2) - Workload Groups

SQL Server 2008: Managing Resources with the Resource Governor (part 1) - Resource Pools

SharePoint 2010 : SQL Backup Tools

Windows Azure : Storing static reference data with dynamic data

SQL Server 2008 : Managing Query Performance - Adding Hints Through Plan Guides

SQL Server 2008 : Managing Query Performance - Forcing a Specific Execution Plan

Programming Microsoft SQL Server 2005 : An Overview of SQL CLR - CLR Triggers

Programming Microsoft SQL Server 2005 : An Overview of SQL CLR - CLR Functions