T-SQL has a number of built-in aggregates, such as SUM(), AVG(),
and MAX(), but
that set of built-in functions is not always sufficient. Luckily, the
SQL CLR features in SQL Server 2005 allow us to implement user-defined
aggregates in .NET code and use them from T-SQL. User-defined aggregates
can be implemented only in SQL CLR code; they have no T-SQL equivalent.
Because aggregates tend to perform computation only, they provide an
excellent use case for SQL CLR code. As it turns out, they are also
quite easy to build.
At first, aggregates feel
and look like functions because they accept and return values. In fact,
if you use an aggregate in a non-data-querying T-SQL call (for example, SELECT SUM(8)),
you are in fact treating the aggregate as if it were a function. The
thing to remember is that the argument passed to an aggregate is
typically a column, and so each discrete value for that column, for
whichever WHERE, HAVING, ORDER BY,
and/or GROUP BY
scope applies, gets passed into the aggregate. It is the aggregate’s
job to update a variable, which eventually will be the return value, as
each discrete value is passed to it.
CLR aggregates require you to apply the SqlUserDefinedAggregate attribute to them.
The SqlUserDefinedAggregate attribute accepts a number of parameters, but all
of them are optional except Format. In our example, we will use the value Format.Native for the Format parameter. For
more advanced scenarios, you might want to study SQL Server Books
Online to acquaint yourself with the other parameters this attribute
accepts. Sticking with Format.Native
for the Format
parameter is sufficient for many scenarios.
Unlike the SqlProcedure,
SqlFunction, and SqlTrigger attributes, the SqlUserDefinedAggregate
attribute is required by SQL Server for your class to be eligible for
use as an aggregate. Visual Studio SQL Server projects do use this
attribute for deployment, and the attribute is included in the aggregate
template, but it also must be used in generic Class Library project
code in order for T-SQL registration of the aggregate to succeed.
Aggregate classes
must have four methods: Init, Accumulate, Merge,
and Terminate. The Accumulate method
accepts a SQL type, the Terminate method returns one, and the Merge method accepts an object typed as the aggregate
class itself.
The Accumulate method handles the processing of a discrete value into
the aggregate value, and the Terminate method returns the final aggregated value
after all discrete values have been processed. The Init method
provides startup code, typically initializing a class-level private
variable that will be used by the AccumulateMerge
method is called in a specific multi-threading scenario, which we will
describe later on.
method. The
Just to be perfectly
clear, your aggregate class will not implement an interface to supply
these methods; you must create them to meet what we might term the
“conventions” that are expected of SQL CLR aggregate classes (as opposed
to a “contract” with which they must comply). When you develop your
code in a Visual Studio 2005 SQL Server project, the Aggregate template
includes stubs for these four methods as well as the proper application
of the SqlUserDefinedAggregate
attribute.
Creating your own aggregates is fairly straightforward, but thinking
through aggregation logic can be a bit confusing at first. Imagine you
want to create a special aggregate called BakersDozen that increments its accumulated value by 1
for every 12 units accumulated (much as a baker, in simpler times, would
throw in a free 13th donut when you ordered 12). By using what you now
know about CLR aggregates and combining that with integer division, you
can implement a BakersDozen aggregate
quite easily. Listing 1, the code from struct BakersDozen in
aggTest.cs in the sample project, contains the entire implementation of
the aggregate BakersDozen.
Listing 1. struct BakersDozen
from aggTest.cs
[Serializable] [Microsoft.SqlServer.Server.SqlUserDefinedAggregate(Format.Native)] public struct BakersDozen { private SqlInt32 DonutCount;
public void Init() { DonutCount = 0; }
public void Accumulate(SqlInt32 Value) { DonutCount += Value + ((Int32)Value) / 12; }
public void Merge(BakersDozen Group) { DonutCount += Group.DonutCount; }
public SqlInt32 Terminate() { return DonutCount; }
}
|
The code
here is fairly straightforward. The private variable DonutCount is used to track
the BakersDozen-adjusted sum of items ordered, adding the actual
items- ordered value and incrementing the running total by the integer
quotient of the ordered value divided by 12. By this logic, bonus items are added only when an
individual value equals or exceeds a multiple of 12. Twelve includes a
full dozen, and so would 13. Twenty-four includes two dozen, and so
would 27. Two individual orders of 6 items each would not generate any
bonus items because a minimum of 12 items must be ordered in a line item
to qualify for a bonus.
To deploy the
aggregate, use attribute-based deployment in the SQL Server project or
the following command for the Class Library version:
CREATE AGGREGATE BakersDozen
(@input int)
RETURNS int
EXTERNAL NAME Chapter03.BakersDozen
Notice that no method name
is specified because the aggregate is implemented by an entire class
rather than an individual function. Notice also that the return value
data type must be declared as the data type of the values this aggregate
function will process. The @input parameter acts as a placeholder, and its
name is inconsequential. Note that aggregates can be built on SQL CLR
types (covered in the next section) as well as SQL scalar types.
Note
The
preceding CREATE AGGREGATE command for the Class Library project version
of the sample code is contained in the CreateObjects.sql script in the
Management Studio project supplied with the sample code. |
To see the aggregate
work, first run the CreateTblAggregateTest.sql script file in the
Management Studio sample project to create a table called AggregateTest
with columns OrderItemId, OrderId, and ItemsOrdered and several rows of
data, as shown here:
CREATE TABLE tblAggregateTest(
[OrderItemId] [int] IDENTITY(1,1) NOT NULL,
[OrderId] [int] NULL,
[ItemsOrdered] [int] NOT NULL
)
GO
INSERT INTO tblAggregateTest VALUES (1,2)
INSERT INTO tblAggregateTest VALUES (1,4)
INSERT INTO tblAggregateTest VALUES (2,1)
INSERT INTO tblAggregateTest VALUES (2,12)
INSERT INTO tblAggregateTest VALUES (3,3)
INSERT INTO tblAggregateTest VALUES (3,2)
With such a table built, use the following T-SQL DDL
command in your Visual Studio test script or a Management Studio query
window to test the aggregate function:
SELECT
OrderId,
SUM(ItemsOrdered) AS SUM,
dbo.BakersDozen(ItemsOrdered) AS BakersDozen
FROM tblAggregateTest
GROUP BY OrderId
For each distinct value in
the OrderId column, this query effectively uses our CLR code under the
following algorithm:
Call Init().
Call Accumulate once
for each row with the same OrderId
value, passing it that row’s value of the ItemsOrdered
column.
Call Terminate
upon a change in the OrderId value to retrieve the aggregated value that the
query will pipe to the client.
The results should be as follows:
OrderId SUM BakersDozen
----------- ----------- -----------
1 6 6
2 13 14
3 5 5
By including the built-in T-SQL aggregate SUM in our query, we
can see how many bonus items were added. In this case, for OrderId 2, a
single bonus item was added, due to one row in the table with the
following values:
OrderItemId OrderId ItemsOrdered
----------- ----------- ------------
4 2 12
All the other rows
contain ItemsOrdered
values of less than 12, so no bonus items were added for them.
Because SQL Server
sometimes segments the work required to satisfy a query over multiple
threads, the query processor might need to execute your aggregate
function multiple times for a single query and then merge the results
together. For your aggregate to work properly in this scenario, you must
implement a Merge method.
The Merge method takes the result of one thread’s
aggregation and merges it into the current thread’s aggregation. The
calculation required to do this could be complicated for some
aggregates; in our case, we simply added the DonutCount value from the secondary thread’s
aggregate (accessible via the Group input parameter) to our own. There is no need to add
bonus items because they would have been added in the individual Accumulate calls
on the secondary thread. Simple addition is all that’s required. An
aggregate that calculated some type of average, or tracked the largest
value in the data series supplied, for example, would require more
complex merge code.
Don’t forget that
aggregates can be passed scalar values and can be used from T-SQL
without referencing a table. Your aggregate must accommodate this
scenario, even if it seems impractical. In the case of BakersDozen, single
scalar values are easily handled. To see for yourself, try executing the
following table-less T-SQL query:
SELECT dbo.BakersDozen(13)
You will see that it
returns the value 14.
Note
The
TestAggregate.sql script file in the Management Studio project contains
both aggregate-testing queries. |
Aggregates are an
excellent use of SQL CLR programming. Because they are passed data
values to be processed, they typically perform only computational tasks
and no data access of their own. They consist of compiled CLR code, so
they perform well, and unlike stored procedures, triggers, and
functions, they cannot be implemented at all in T-SQL. That said, you
must still make your aggregate code, especially in the Accumulate
method, as “lean and mean” as possible. Injecting your own code into the
query processor’s stream of work is an honor, a privilege, and a
significant responsibility. Take that responsibility seriously, and make
sure that your code is as low-impact as possible.