SQL Server 2012 : Exploring SQL CLR - CLR Aggregates

11/25/2013 8:30:56 PM

T-SQL has a number of built-in aggregates, such as SUM, AVG, and MAX, but that set of built-in functions is not always sufficient. Luckily, the SQL CLR features in SQL Server 2012 allow us to implement user-defined aggregates in .NET code and use them from T-SQL. User-defined aggregates can be implemented only in SQL CLR code; they have no T-SQL equivalent. Because aggregates tend to perform computation only, they provide an excellent use case for SQL CLR code. As it turns out, they are also quite easy to build.

At first, aggregates look and feel like functions because they accept and return values. In fact, if you use an aggregate in a non-data-querying T-SQL call (for example, SELECT SUM(8)), you are actually treating the aggregate as if it were a function. The thing to remember is that the argument passed to an aggregate is typically a column, and so each discrete value for that column—for whichever WHERE, HAVING, ORDER BY, and/or GROUP BY scope applies—gets passed in to the aggregate. It is the aggregate’s job to update a variable, which eventually will be the return value, as each discrete value is passed to it.

CLR aggregates require you to apply the SqlUserDefinedAggregate attribute to them. The SqlUserDefinedAggregate attribute accepts a number of parameters, but all of them are optional except Format. In our example, we will use the value Format.Native for the Format parameter.

Note

User-defined aggregates greater than 8,000 bytes in size require use of the Format.UserDefined value for the Format parameter, rather than the Format.Native value used in this example.

Unlike the SqlProcedure, SqlFunction, and SqlTrigger attributes, the SqlUserDefinedAggregate attribute is required by SQL Server for your class to be eligible for use as an aggregate. Visual Studio SQL Server Database Projects do use this attribute for deployment, and the attribute is included in the aggregate template, but it also must be used in generic Class Library project code in order for T-SQL registration of the aggregate to succeed.

Aggregate classes must have four methods: Init, Accumulate, Merge, and Terminate. The Init method is used to clean up as necessary from previous uses of this aggregate instance, allowing it to re-start a new aggregate computation. The Accumulate method accepts a SQL type for processing. The Terminate method returns a SQL type representing the result. And finally, the Merge method accepts an object typed as the aggregate class itself so that it can be combined with the executing instance.

The Accumulate method handles the processing of a discrete value into the aggregate value, and the Terminate method returns the final aggregated value after all discrete values have been processed. The Init method provides startup code, typically initializing a class-level private variable that will be used by the AccumulateMerge method is called in a specific multithreading scenario, which we will describe later in this section. method. The

Just to be perfectly clear, your aggregate class will not implement an interface to supply these methods; you must create them to meet what we might term the “conventions” that are expected of SQL CLR aggregate classes (as opposed to a “contract” with which they must comply). When you develop your code in a Visual Studio SQL Server Database Project, the Aggregate template includes stubs for these four methods as well as the proper application of the SqlUserDefinedAggregate attribute.

Creating your own aggregates is fairly straightforward, but thinking through aggregation logic can be a bit confusing at first. Imagine that you want to create a special aggregate named BakersDozen that increments its accumulated value by 1 for every discrete value’s multiple of 12 (much as a baker, in simpler times, would throw in a free 13th donut when you ordered 12). By using what you now know about CLR aggregates and combining that with integer division, you can implement a BakersDozen aggregate quite easily. Example 1, the code from struct BakersDozen in aggTest.cs in the sample project, contains the entire implementation of the aggregate BakersDozen.

Example 1. BakersDozen struct from aggTest.cs.

[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(Format.Native)]
public struct BakersDozen
{
    private SqlInt32 DonutCount;

    public void Init()
    {
        DonutCount = 0;
    }

    public void Accumulate(SqlInt32 Value)
    {
        DonutCount += Value + ((Int32)Value) / 12;
    }

    public void Merge(BakersDozen Group)
    {
        DonutCount += Group.DonutCount;
    }

    public SqlInt32 Terminate()
    {
        return DonutCount;
    }
}

The code here is fairly straightforward. The private variable DonutCount is used to track the BakersDozen-adjusted sum of items ordered, adding the actual items ordered value and incrementing the running total by the integer quotient of the ordered value divided by 12. By this logic, bonus items are added only when an individual value equals or exceeds a multiple of 12. Twelve includes a full dozen, and so would 13. Twenty-four includes two dozen, and so would 27. Two individual orders of 6 items each would not generate any bonus items because a minimum of 12 items must be contained in a discrete value to qualify for a bonus.

To deploy the aggregate, use attribute-based deployment in the SQL Server Database Project or the following command for the Class Library version:

CREATE AGGREGATE BakersDozen (@input int)
 RETURNS int
 EXTERNAL NAME SQLCLRDemo.BakersDozen

Notice that no method name is specified because the aggregate is implemented by an entire class rather than an individual function. Notice also that the return value data type must be declared as the data type of the values this aggregate function will process. The @input parameter acts as a placeholder, and its name is inconsequential. Note that aggregates can be built on SQL CLR types (covered in the next section) as well as SQL scalar types.

To see the aggregate work, first run the CreateTblAggregateTest.sql script file in the SSMS sample project to create a table named AggregateTest with columns OrderItemId, OrderId, and ItemsOrdered and several rows of data, as shown here:

CREATE TABLE tblAggregateTest(
    [OrderItemId] [int] IDENTITY(1,1) NOT NULL,
    [OrderId] [int] NULL,
    [ItemsOrdered] [int] NOT NULL
)
GO

INSERT INTO tblAggregateTest VALUES (1,2)
INSERT INTO tblAggregateTest VALUES (1,4)
INSERT INTO tblAggregateTest VALUES (2,1)
INSERT INTO tblAggregateTest VALUES (2,12)
INSERT INTO tblAggregateTest VALUES (3,3)
INSERT INTO tblAggregateTest VALUES (3,2)

With such a table built, use the following T-SQL DML statement in a query window to test the aggregate function:

SELECT
  OrderId,
  SUM(ItemsOrdered) AS SUM,
  dbo.BakersDozen(ItemsOrdered) AS BakersDozen
FROM tblAggregateTest
GROUP BY OrderId

For each distinct value in the OrderId column, this query effectively uses the CLR code under the following algorithm:

Call Init().
Call Accumulate once for each row with the same OrderId value, passing it that row’s value of the ItemsOrdered column.
Call Terminate to retrieve the aggregated value that the query will return to the client.

The results should be as follows:

OrderId           SUM              BakersDozen
-----------       -----------      -----------
1                 6                6
2                 13               14
3                 5                5

By including the built-in T-SQL aggregate SUM in the query, you can see how many bonus items were added. In this case, for OrderId 2, a single bonus item was added, due to one row in the table with the following values:

OrderItemId      OrderId          ItemsOrdered
-----------      -----------      ------------
4                2                12

All the other rows contain ItemsOrdered values of less than 12, so no bonus items were added for them.

Because SQL Server sometimes segments the work required to satisfy a query over multiple threads, the query processor might need to execute your aggregate function multiple times for a single query and then merge the results together. For your aggregate to work properly in this scenario, you must implement a Merge method.

The Merge method takes the result of one thread’s aggregation and merges it into the current thread’s aggregation. The calculation required to do this could be complicated for some aggregates; in our case, you simply added the DonutCount value from the secondary thread’s aggregate (accessible via the Group input parameter) to your own. There is no need to add bonus items because they would have been added in the individual Accumulate calls on the other thread. Simple addition is all that’s required. An aggregate that calculated some type of average, for example, would require more complex Merge code.

Don’t forget that aggregates can be passed scalar values and can be used from T-SQL without referencing a row set. Your aggregate must accommodate this scenario, even if it seems impractical. In the case of BakersDozen, single scalar values are easily handled. To see for yourself, try executing the following table-less T-SQL query:

SELECT dbo.BakersDozen(13)

You will see that it returns the value 14.

Note

The TestAggregate.sql script file in the SSMS project contains both aggregate-testing queries.

Aggregates are an excellent use of SQL CLR programming. Because they are passed data values to be processed, they typically perform only computational tasks and no data access of their own. They consist of compiled CLR code, so they perform well, and unlike stored procedures, triggers, and functions, they cannot be implemented at all in T-SQL. That said, you must still make your aggregate code, especially in the Accumulate method, as “lean and mean” as possible. Injecting your own code into the query processor’s stream of work is an honor, a privilege, and a significant responsibility. Take that responsibility seriously, and make sure that your code is as low-impact as possible.

Note

More Info SQL Server 2012 provides (as did SQL Server 2008R2 and 2008 before it) support for aggregates that accept multiple input parameters. Implementing such aggregate functions involves accepting two or more parameters in the Accumulate method, processing them in your code, and then declaring them in the CREATE AGGREGATE T-SQL statement, if you are not using SSDT auto-deployment. Aggregates that accept multiple input parameters still must, of course, return a single value.

Other

SQL Server 2012 : Exploring SQL CLR - CLR Triggers

My SQL : Replication for High Availability - Procedures (part 8) - Circular Replication

My SQL : Replication for High Availability - Procedures (part 7) - Slave Promotion - Slave promotion in Python

My SQL : Replication for High Availability - Procedures (part 6) - Slave Promotion - A revised method for promoting a slave

My SQL : Replication for High Availability - Procedures (part 5) - Semisynchronous Replication - Configuring semisynchronous replication

My SQL : Replication for High Availability - Procedures (part 4) - Dual Masters - Replicated disks using DRBD , Bidirectional replication

My SQL : Replication for High Availability - Procedures (part 3) - Dual Masters - Shared disks

My SQL : Replication for High Availability - Procedures (part 2) - Hot Standby

My SQL : Replication for High Availability - Procedures (part 1) - The mysql.com Outage

My SQL : Replication for High Availability - Redundancy, Planning