Like the index at the end of
this book, indexes within a database enable fast access to table
contents. With each table in a SQL Server 2008 database supporting up to
1000 indexes, fast access can be enabled for a wide variety of lookups.
However, as you'll soon see, poor index selection and maintenance can
have the opposite effect, with reduced performance a common outcome.
It's possible (and common) for tables to be created without any indexes. Such tables are known as heaps. Before continuing, let's take a brief look at heaps.
1. Heaps
Consider the script in listing 1, which creates a simple table and inserts five rows.
Example 1. Creating a heap table
--a heap table and seed with data
CREATE TABLE dbo.client (
clientCode int
, surname nvarchar(100)
, firstName nvarchar(100)
, SSN char(12)
, DOB datetime
)
GO
INSERT INTO dbo.client (
clientCode
, surname
, firstName
, SSN
, DOB
)
VALUES (1, 'Smith', 'John', '111-622-3033', '18 Jun 1974')
, (2, 'Jones', 'Harry', '121-221-3933', '01 Mar 1964')
, (3, 'Brown', 'Bill', '113-262-3223', '19 Apr 1949')
, (4, 'Dean', 'Sally', '191-422-3775', '26 Dec 1979')
, (5, 'White', 'Linda', '118-252-2243', '01 Jan 1998')
GO
|
Heap tables, such as the one created in listing 1,
store their data in no particular physical order.
2. Clustered indexes
A clustered index is added to a table using the create clustered index command, as per the following example:
CREATE CLUSTERED INDEX cixClientSSN ON dbo.client(SSN)
GO
After you create the clustered index on the
Social Security number (SSN) column, the data within the client table is
physically ordered by SSN. The table can now be considered a clustered table.
In addition to reordering the existing rows in the table, new rows will
be inserted in order, based on the SSN value being inserted. Figure 1
illustrates the difference between inserting a record into the client
table as a heap compared to a version with a clustered index on SSN.
Unlike nonclustered indexes, which we'll cover
shortly, a table can have only one clustered index; that is, you can
physically order the rows in a table in only one way. In order to enable
fast access to the data within a table based on the value of other
column(s), we can create additional (up to 999) nonclustered indexes.
Despite the name, a clustered index is more
than just an index. It also contains the table data itself, which is
stored at the leaf level of the index. By default, a primary key
constraint will be created as a clustered index, thereby physically
ordering the table rows based on the value of the primary key column(s).
|
Take our client table above (and assume it was
populated with millions of rows). If we wanted to access a client, and
we had their SSN, we would issue a query similar to this:
SELECT *
FROM dbo.client
WHERE SSN = '111-622-3033'
Given the client table is physically ordered by
SSN, SQL Server can very quickly access the required record(s). Let's
assume now that we wanted to find a client based on their date of birth
and surname, as per the following query:
SELECT *
FROM dbo.client
WHERE
DOB = '18 Jun 1974'
AND surname = 'Smith'
In this case, since the table is ordered by SSN
and there are no other indexes in place, SQL Server is forced to scan
through the table a row at a time, looking for matching records. On
small tables, the performance overhead involved in such a scan wouldn't
be noticed. However, as the table grows, so too does the performance
hit. To address this, we can create additional nonclustered indexes.
3. Nonclustered indexes
A nonclustered index is created using the create nonclustered index command, as per the following example;
CREATE NONCLUSTERED INDEX ixClientDOBSurname ON dbo.client(DOB, surname)
GO
What we've done here is create an index on the combination
of the DOB and Surname columns. In doing so, a separate physical index
structure containing, and ordered by, DOB and Surname is created and
maintained in line with the table. Each time a row is inserted, updated,
or deleted from the client table, the corresponding updates are made to
the nonclustered index. When running a query such as the one above that
selects data based on the DOB/Surname combination, the index is used,
or looked up, with the leaf level of the matching index entries pointing
to the appropriate records in the table.
A good way of understanding the difference
between clustered and nonclustered indexes is thinking about a
paper-based phone book, ordered by surname. If you're looking for the
address of someone whose surname is White, you'd immediately flip to the
back of the book and adjust your search from there. In this sense, the
phone book can be considered to be clustered on Surname.
On the other hand, if all you had was a phone
number and you needed the matching address, your only choice would be to
scan every page of the book, an immensely time-consuming process! If
there was a section added to the rear of the book containing ordered
phone numbers with a corresponding name, you could then flip to the
appropriate page to retrieve the address. This extra section of ordered
phone numbers can be considered a nonclustered index.
Taking this example further, let's imagine email
addresses were added as additional contact information. If all we had
was someone's email address, and we wanted their street address (or
phone number), we'd be back to scanning through the book from start to
finish. As we did with phone numbers, we could build another section at
the back of the book containing ordered email addresses with a
corresponding name to enable fast lookup to the section of the book
containing the contact details. Figure 2 illustrates this concept further.
As per database tables, each time we add indexes
to the rear of our phone book, two things happen. First, the size of
the book increases, and second, we have additional information to
maintain. If anyone changes their name, phone number, or email address,
as well as updating the appropriate page within the book, we'd have to
update one (or both) of the additional indexes. The additional size and
maintenance overhead are important considerations when creating
nonclustered indexes.
Before we complete our brief overview of
clustered and nonclustered indexes, let's examine their internal
structure a little further.
In addition to the index types covered in this book, SQL Server includes support for full-text indexes. Unlike the basic partial-text matching offered by the LIKE
command, full-text indexes enable optimized and comprehensive searching
against large volumes of unstructured data, including linguistic
searching based on the installed language set.
|
4. Index structure
One of the important aspects of nonclustered
indexes we've yet to cover is how they link to the records in the base
table. The section of the nonclustered index that points to the record
in the base table is known as a row locator. If the base table is a heap (no clustered index) the row locator is a pointer to the appropriate table row, identified with a row ID.
The row ID consists of the file ID, the page number, and the row number
on the page. If a table contains a clustered index, a nonclustered
index's row locator is the row's clustered index key.
Indexes within SQL Server (both clustered and nonclustered) are implemented as B-trees. An index seek
starts at the top, or root node, of the tree and traverses through
intermediate levels before arriving at the leaf nodes. In a clustered
index, the leaf nodes contain the actual table data. In a nonclustered
index, the leaf nodes contain the row locators. Figure 3 illustrates this structure further.
With SQL Server including the clustered index
key value in the row locator of nonclustered indexes, two important
design points emerge. First, the width,
or size, of the clustered index directly affects the size of each
nonclustered index. For example, a single integer-based column is much
narrower or smaller than four char(100)-based columns. It follows that
from both storage efficiency and lookup performance perspectives,
clustered indexes should be as small and as narrow as possible.
The other design point to emerge is that the
clustered index should be created before the nonclustered indexes. Think
about what would happen if a table was created as a heap with
nonclustered indexes. As we covered previously, each nonclustered index
would use the heap's row ID as its row locator. If a clustered index was
then created on the table, each of the nonclustered indexes would need
to be updated to include the clustered index key in place of the row ID.
Finally, an important property of a clustered index is whether it's created with a unique
constraint. A nonclustered index's row locator must point to a single
row, so if the clustered index is not unique, which row will the row
locator point to? In addressing this issue, SQL Server will make
nonunique clustered indexes unique by adding a hidden uniqueifier value, which distinguishes rows with the same clustered index value.
The use of the row locator to link a
nonclustered index row to its corresponding heap or clustered index row
is typically referred to as a bookmark lookup; however, in SQL Server 2008, a bookmark lookup operation no longer exists and has been replaced with clustered index seeks, RID lookups, and key lookups.
Which of these is used is determined by the presence (or absence) of a
clustered index and whether SQL Server chooses to use an index or scan
the table. Let's explore this a little further with a look at the key
lookup process.
5. Key lookup
When SQL Server uses a nonclustered index to
fulfill a query request, it uses a key lookup to retrieve the data from
the clustered index (and a RID lookup in the case of a heap). When a
nonclustered index returns one (or few) rows, this is an efficient and
fast operation. However, as the number of matching rows increases, the
combined cost of the index seek plus key lookup operations increases to
the point where it may be quicker to simply scan the base table.
Consider the following two queries:
-- Query A
SELECT * FROM client WHERE SSN = '191-422-3775'
-- Query B
SELECT * FROM client WHERE surname = 'Smith'
In Query A, an index lookup on SSN will return a
single row, assuming each client has a unique SSN. In Query B, we're
looking for people with a surname of Smith. Assuming the client table
has millions of rows, this query could return thousands of matching
rows.
Let's assume that the client table has two
nonclustered indexes to support each of the above queries—one on SSN and
the other on surname—with a clustered index on clientCode. In the first
query, the SSN index seek would return one row, with a single key
lookup required on the clustered index to return the remaining columns.
Contrast this with the second query, which may return thousands of
instances of Smith, each of which requires a key lookup. Because the
base table is clustered on clientCode, each of the resultant Smith key
lookups will be fulfilled from different physical parts of the table,
requiring thousands of random I/O operations.
Depending on the number of rows to be returned,
it may be much faster to ignore the index and sequentially scan the
table's clustered index. Despite reading more rows, the overall cost of a
single, large, sequential I/O operation may be less than thousands of
individual random I/Os.
A couple of interesting points emerge from this.
The first is the possibility of including additional columns in the
nonclustered index to avoid the key lookup. For example, rather than select *,
we specify a limited number of query columns, each of which is added to
the nonclustered index definition. By doing so, the query request is
satisfied from the contents of the nonclustered index alone, avoiding
the need for key lookups to the clustered index.
Second, how does SQL Server know how many key
lookups will be involved in fulfilling a query request? How does it
decide that it's more efficient to do a table/clustered index scan?
Enter statistics.
6. Statistics
When indexes are first created, SQL Server
calculates and stores statistical information on the column values in
the index. When evaluating how a query will be executed, that is,
whether to use an index or not, SQL Server uses the statistics to
estimate the likely cost of using the index compared to other
alternatives.
The selectivity
of an index seek is used to define the estimated percentage of matched
rows compared to the total number of rows in the table. Broadly
speaking, the overall selectivity of an index can be defined as follows:
index selectivity = number of distinct index keys / table row count
The selectivity of an index will range from 0 to
1, or the equivalent value expressed as a percentage. Looking at the
extreme ends of the selectivity scale, a unique index represents the
best possible selectivity of 1, or 100 percent; every row has a unique
(distinct) value. Primary key indexes are perfectly selective. In
contrast, an index on a column where every row has the same value
(approaching 0 percent selectivity) is obviously of limited/zero value.
By keeping statistics on the distribution of
column values in the index, the selectivity of the query can be
estimated. As an example, if there are 1020 Smiths and 2 Zatorskys in a
surname index, a search on Zatorsky is far more selective than a search
on Smith and therefore is more likely to use an index lookup.
An important aspect of statistics is whether
they remain up to date. Consider an example where an index starts out
being highly selective. A query that performs an equals
search on such an indexed column would obviously perform very well. But
if a large volume of rows was added to the table with the same indexed
column value, the selectivity would lower, potentially dramatically,
depending on the volume of rows added and the contents of the indexed
column. At this point, the same query, using the same index, may then
perform very poorly. To counter situations such as this, SQL Server has a
number of settings for automatically updating statistics.
With this background in mind, let's move on to look at the index design process.