A good database design is
made in conjunction with, and is conscious of, application data access
logic. For example, in order to design indexes for a particular table,
the database designer must know how users will be accessing the table
from the application(s). If an application allows searching for data on a
particular column or set of columns, then this needs to be considered
from an indexing point of view. That's not to suggest that the
application completely dictates index design. The reverse is often true;
sometimes unrealistic application access must be modified in order to
prevent user-generated activity that causes database performance
problems.
In this section, we'll concentrate on generic
index design strategies, beginning with the type of columns suitable for
a clustered index. We'll then look at an area we touched on in our
introduction, covering indexes and included columns, before concluding
the section with coverage of a new feature in SQL Server 2008, filtered indexes, and how they compare with indexed views.
Let's begin with an important step in table design: selecting a clustered index.
1. Selecting a clustered index
When a table is created with a primary key
constraint, as per the following example, a unique clustered index is
automatically created on the column(s) in the primary key, unless
specified otherwise.
-- Creates a clustered index by default on the clientCode primary key
CREATE TABLE dbo.client (
clientCode int PRIMARY KEY
, surname nvarchar(100)
, firstName nvarchar(100)
, SSN char(11)
, DOB datetime
)
GO
In this example, the clientCode column will be
used as the primary key of the table as well as the unique clustered
index. Defining the column as the primary key means an explicit CREATE CLUSTERED INDEX
command is not required. Should we wish to create the clustered index
on a different column, SSN for example, we could create the table as
follows:
-- Create a clustered index on a nonprimary key column
CREATE TABLE dbo.client (
clientCode int PRIMARY KEY NONCLUSTERED
, surname nvarchar(100)
, firstName nvarchar(100)
, SSN char(11)
, DOB datetime
)
GO
CREATE UNIQUE CLUSTERED INDEX cixClientSSN ON dbo.client(SSN)
GO
Created in this manner, the client table will
contain two indexes: a unique nonclustered index for the primary key
constraint and a unique clustered index for the SSN column.
So, generally speaking, which types of columns
make the best candidates for a clustered index?
The clustered index key is contained in
the leaf node of each nonclustered index as the row locator. If the
clustered index changes from one column to another, each nonclustered
index needs to be updated in order to maintain the linkage from the
nonclustered index to the base table. Further, if the column value of
the clustered index changes, similar updates are required in each of the
nonclustered indexes.
The width of the
clustered index directly affects the size of each nonclustered index.
Again, this is a consequence of including the clustered index key in the
leaf node of each nonclustered index.
If
a clustered index is not unique, SQL Server will make it so by adding a
hidden uniqueifier column to the table for inclusion in the index.
It follows that the best candidates for a clustered index are columns that
Change infrequently (ideally not at all)—A stable column value avoids the need to maintain nonclustered index row locators.
Are narrow—They limit the size of each nonclustered index.
Are unique—They avoid the need for a uniqueifier.
With these attributes in mind, a common pattern for table design is to create what's called a surrogate key, using the IDENTITY property as per this example:
-- Use the IDENTITY property to create a clustered primary key column
CREATE TABLE dbo.client (
clientKey int IDENTITY (1,1) PRIMARY KEY
, surname nvarchar(100)
, firstName nvarchar(100)
, SSN char(11)
, DOB datetime
)
GO
By adding the IDENTITY (1,1) property to the clientKey
column definition, SQL Server will populate this column's value with an
automatically incrementing number for each new row, starting at 1 for
the first row and increasing upward by 1 for each new row.
Using the IDENTITY property to create a
surrogate key in this manner meets the desired attributes for a
clustered index. It's an arbitrary number used purely to identify the
record, and therefore it has no reason to be modified. It's narrow: a
single integer-based column will occupy only 4 bytes. Finally, it's
unique; SQL Server will automatically take care of the uniqueness,
courtesy of the IDENTITY property.
In our client table example, the other candidate
for a clustered index, as well as the primary key, is the Social
Security number. It's reasonably narrow (11 bytes), unlikely to change,
and unique. In fact, if we made SSN the unique clustered primary key,
we'd have no need for the identity-based clientKey column. But there's
one big problem here. It's unique for those who have
an SSN. What about those who don't have one or those who can't recall
it? If the SSN was the primary key value, the lack of an SSN would
prevent a row from being inserted into the table.
For this reason, the best primary keys/unique clustered indexes tend to
be artificial or surrogate keys that lack meaning and use
system-generated uniqueness features such as the identity column. Of
course, there are exceptions to this rule, and this is a commonly argued
point among database design professionals.
The other consideration for a clustered index is
column(s) involved in frequent range scans and queries that require
sorted data.
A common database design practice is to use
globally unique identifier (GUID) columns as primary keys, which by
default will also be the table's clustered index unless specified
otherwise. Not only are GUIDs wide (16 bytes), they're randomly
generated. Given such tables are clustered on the GUID column, newly
inserted rows will be randomly positioned throughout the table, leading
to page splits and subsequent fragmentation. This is a particular
concern for tables with a high volume of data inserts. SQL Server 2005
introduced the NewSequentialID() function, which partially
offsets this problem by generating GUIDs in a sequential manner.
Removing the "randomness" from the GUID values helps in reducing both
page splits and fragmentation.
|
Range scans and sort operations
The case where
nonclustered indexes are sometimes ignored if the estimated number of
rows to be returned exceeds a certain percentage. The reason for this is
the accumulated cost of the individual key/RID lookup and random I/O
operations for each row.
For tables that are frequently used in
range-scanning operations, clustering on the column(s) used in the range
scan can provide a big performance boost. As an example, consider a
sales table with an orderDate column and frequent queries such as this
one:
-- Range Scan - Potential for a clustered index on orderDate?
SELECT *
FROM dbo.sales
WHERE orderDate BETWEEN '1 Jan 2008' AND '1 Feb 2008'
Depending on the statistics, a nonclustered
index seek on orderDate will more than likely be ignored because of the
number of key lookups involved. However, a clustered index on orderDate
would be ideal; using the clustered index, SQL Server would quickly
locate the first order and then use sequential I/O to return all
remaining orders for the date range.
Finally, queries that select large volumes of sorted (ORDER BY) data often benefit from clustered indexes on the column used in the ORDER BY clause. With the data already sorted in the clustered index, the sort operation is avoided, boosting performance.
Often, a number of attributes come together to
make a column an ideal clustered index candidate. Take, for example, the
previous query, which selects orders based on a date range; if that
query also required orders to be sorted, then we could avoid both key
lookups and sort operations by clustering on orderDate.
The process for choosing the best
clustered index is obviously dependent on the specifics of each database
table and knowledge of how applications use the table. That being said,
the above recommendations hold true in most cases. In a similar manner,
there are a number of common techniques used in designing nonclustered
indexes.