In SQL Server 2008, Microsoft introduced a new
feature called Change Data Capture (CDC), which is designed to make it
much easier and less resource intensive to identify and retrieve changed
data from tables in an online transaction processing (OLTP) database.
In a nutshell, CDC captures and records INSERT, UPDATE, and DELETE
activity in an OLTP database and stores it in a form that is easily
consumed by an application, such as a SQL Server Integration Services
(SSIS) package.
In the past, capturing
data changes for your tables for auditing or extract, transform, and
load (ETL) purposes required using replication, time stamp columns,
triggers, complex queries, or expensive third-party tools. None of these
other methods are easy to implement, and many of them use a lot of
server resources, negatively affecting the performance of the OLTP
server.
Change Data Capture provides for a more efficient mechanism for capturing the data changes in a table.
Note
Change Data Capture is available only in the SQL Server 2008 Developer, Enterprise, and Datacenter Editions.
The source of change data
for Change Data Capture is the SQL Server transaction log. As inserts,
updates, and deletes are applied to tables, entries that describe those
changes are added to the transaction log. When Change Data Capture is
enabled for a database, a SQL Server Agent capture job is created to
invoke the sp_replcmds system
procedure. This procedure is an internal server function and is the same
mechanism used by transactional replication to harvest changes from the
transaction log.
Note
If replication is
already enabled for the database, the transactional log reader used for
replication is also used for CDC. This strategy significantly reduces
log contention when both replication and Change Data Capture are enabled
for the same database.
The principal task of the
Change Data Capture process is to scan the log and identify changes to
data rows in any tables configured for Change Data Capture. As these
changes are identified, the process writes column data and
transaction-related information to the Change Data Capture tables. The
changes can then be read from these change tables to be applied as
needed.
The Change Data Capture Tables
When CDC is enabled for a
database and one or more tables, an associated Change Data Capture table
is created for each table being monitored. The Change Data Capture
tables are used to store the changes made to the data in corresponding
source tables, along with some metadata used to track the changes. By
default, the name of the CDC change table is schemaname_tablename_CT and is based on the name of the source table.
The first five columns of a
Change Data Capture change table are metadata columns and contain
additional information relevant to the recorded change:
__$start_lsn—
Identifies the commit log sequence number (LSN) assigned to the change.
This value can be used to determine the order of the transactions.
__$end_lsn— Is currently not used and in SQL Server 2008 is always NULL.
__$seqval— Can be used to order changes that occur within the same transaction.
__$operation— Records the operation associated with the change: 1 = delete, 2 = insert, 3 = update before image(delete), and 4 = update after image(insert)
__$update_mask—
Is a variable bit mask with one defined bit for each captured column to
identify what columns were changed. For insert and delete entries, the
update mask always has all bits set. Update rows have the bits set only
for the columns that were modified.
The remaining columns in
the Change Data Capture change table are identical to the columns from
the source table in name and type and are used to store the column data
gathered from the source table when an insert, update, or delete
operation is performed on the table.
For every row inserted
into the source table, a single row a single row is inserted into the
change table, and this row contains the column values inserted into the
source table. Every row deleted from the source table is also inserted
as a single row into the change table but contains the column values in
the row before the delete operation. An update operation is captured as a
delete followed by an insert, so two rows are captured for each update:
one row entry to capture the column values before the update, and a
second row entry to capture the column values after the update.
In addition to the Change Data Capture tables, the following Change Data Capture metadata tables are also created:
cdc.change_tables— Contains one row for each change table in the created when Change Data Capture is enabled on a source table.
cdc.index_columns—
Contains one row for each index column used by Change Data Capture to
uniquely identify rows in the source table. By default, this is the
column of the primary key of the source table, but a different unique
index on the source table can be specified when Change Data Capture is
enabled on the source table. A primary key or unique index is required
on the source table only if Net Change Tracking is enabled.
cdc.captured_columns—
Contains one row for each column tracked in each source table. By
default, all columns of the source table are captured, but you can
include or exclude columns when enabling Change Data Capture for a table
by specifying a column list.
cdc.ddl_history—
Contains a row for each Data Definition Language (DDL) change made to
any table enabled for Change Data Capture. You can use this table to
determine when a DDL change occurred on a source table and what the
change was.
cdc.lsn_time_mapping—
Contains a row for each transaction stored in a change table and is
used to map between log sequence number (LSN) commit values and the
actual time the transaction was committed.
Although you can query the
Change Data Capture tables directly, it is not recommended. Instead, you
should use the Change Data Capture functions, which are discussed
later.
All these objects associated with a CDC instance are created in the special schema called cdc when Change Data Capture is enabled for a database.
Enabling CDC for a Database
Before you can begin
capturing data changes for a table, you must first enable the database
for Change Data Capture. You do this by running the stored procedure sys.sp_cdc_enable_db within the desired database context. When a database is enabled for Change Data Capture, the cdc schema, cdc user, metadata tables, as well as the system functions, are used to query for change data.
Note
To determine whether a database is already enabled for CDC, you can check the value in the is_cdc_enabled column in the sys.databases catalog view. A value of 1 indicates that CDC is enabled for the specified database.
The following SQL code enables CDC for the AdventureWorks2008R2 database and then checks that CDC is enabled by querying the sys.databases catalog view:
use AdventureWorks2008R2
go
exec sys.sp_cdc_enable_db
go
select is_cdc_enabled
from sys.databases
where name = 'AdventureWorks2008R2'
go
is_cdc_enabled
--------------
1
Note
Although the examples presented here are run against the AdventureWorks2008R2 database, they can also be run against the AdventureWorks2008 database. However, you should be aware that some of the column values displayed may not be exactly the same.
Enabling CDC for a Table
When the database is enabled for Change Data Capture, you can use the sys.sp_cdc_enable_table stored procedure to enable a Change Data Capture instance for any tables in that database. The sp_cdc_enable_Table stored procedure supports the following parameters:
@source_schema— Specifies the name of the schema in which the source table resides.
@source_name— Specifies the name of the source table.
@role_name— Indicates the name of the database role used to control access to Change Data Capture tables. If this parameter is set to NULL,
no role is used to limit access to the change data. If the specified
role does not exist, SQL Server creates a database role with the
specified name.
@capture_instance—
Specifies the name of the capture instance used to name the
instance-specific Change Data Capture objects. By default, this is the
source schema name plus the source table name in the format schemaname_sourcename. A source table can have a maximum of two capture instances.
@supports_net_changes— Is set to 1 or 0
to indicate whether support for querying for net changes is to be
enabled for this capture instance. If this parameter is set to 1, the source table must have a defined primary key, or an alternate unique index must be specified for the @index_name parameter.
@index_name— Specifies the name of a unique index to use to uniquely identify rows in the source table.
@captured_column_list—
Specifies the source table columns to be included in the change table.
By default, all columns are included in the change table.
@filegroup_name— Specifies the filegroup to be used for the change table created for the capture instance. If this parameter is NULL
or not specified, the default filegroup is used. If possible, it is
recommended you create a separate filegroup from your source tables for
the Change Data Capture change tables.
@allow_partition_switch— Indicates whether the SWITCH PARTITION command of ALTER TABLE can be executed against a table that is enabled for Change Data Capture. The default is 1
(enabled). If any partition switches occur, Change Data Capture does
not track the changes resulting from the switch. This causes data
inconsistencies when the change data is consumed.
The @source_schema, @source_name, and @role_name parameters are the only required parameters. All the others are optional and apply default values if not specified.
To implement basic change data tracking for a table, let’s first create a copy of the Customer table to play around with:
select * into MyCustomer from Sales.Customer
alter table MyCustomer add Primary key (CUstomerID)
Now, to enable CDC on the MyCustomer table, you can execute the following:
EXEC sys.sp_cdc_enable_table
@source_schema = N'dbo',
@source_name = N'MyCustomer',
@role_name = NULL
Note
If
this is the first time you are enabling CDC for a table in the
database, you may see the following messages, which indicate that SQL
Server is enabling the SQL Agent jobs to begin capturing the data
changes in the database:
Job 'cdc.AdventureWorks2008R2_capture' started successfully.
Job 'cdc.AdventureWorks2008R2_cleanup' started successfully.
The Capture job that is
created generally runs continuously and is used to move changed data to
the CDC tables from the transaction log. The Cleanup job runs on a
scheduled basis to remove older data from the CDC tables so that they
don’t grow too large. By default, it automatically removes data that is
more than three days old. The properties of these jobs can be viewed and
modified using the sys.sp_cdc_help_jobs and sys.sp_cdc_change_job procedures, respectively.
To determine whether or not a source table has been enabled for Change Data Capture, you can query the is_tracked_by_cdc column in the sys.tables catalog view for that table:
select is_tracked_by_cdc
from sys.tables
where name = 'MyCustomer'
go
is_tracked_by_cdc
-----------------
1
Tip
To get information on which tables are configured for CDC and what the settings for each are, you can execute the sys.sp_cdc_help_change_data_capture
stored procedure. It reports the name and ID of the source and Change
Tracking tables, the CDC table properties, the columns included in the
capture, and the date the CDC was enabled/created for the source table.