SQL Server 2008 : Index design and maintenance - An introduction to indexes

2/17/2013 6:39:03 PM

Like the index at the end of this book, indexes within a database enable fast access to table contents. With each table in a SQL Server 2008 database supporting up to 1000 indexes, fast access can be enabled for a wide variety of lookups. However, as you'll soon see, poor index selection and maintenance can have the opposite effect, with reduced performance a common outcome.

It's possible (and common) for tables to be created without any indexes. Such tables are known as heaps. Before continuing, let's take a brief look at heaps.

1. Heaps

Consider the script in listing 1, which creates a simple table and inserts five rows.

Example 1. Creating a heap table

--a heap table and seed with data
CREATE TABLE dbo.client (
   clientCode int
   , surname nvarchar(100)
   , firstName nvarchar(100)
   , SSN char(12)
   , DOB datetime
)
GO

INSERT INTO dbo.client (
   clientCode
   , surname
   , firstName
   , SSN
   , DOB
)
VALUES (1, 'Smith', 'John', '111-622-3033', '18 Jun 1974')
, (2, 'Jones', 'Harry', '121-221-3933', '01 Mar 1964')
, (3, 'Brown', 'Bill', '113-262-3223', '19 Apr 1949')
, (4, 'Dean', 'Sally', '191-422-3775', '26 Dec 1979')
, (5, 'White', 'Linda', '118-252-2243', '01 Jan 1998')
GO

Heap tables, such as the one created in listing 1, store their data in no particular physical order.

2. Clustered indexes

A clustered index is added to a table using the create clustered index command, as per the following example:

CREATE CLUSTERED INDEX cixClientSSN ON dbo.client(SSN)
GO

Figure 1. Inserting a row into a table with a clustered index will physically position the row based on the value of the column in the clustered index key, in this case SSN.

After you create the clustered index on the Social Security number (SSN) column, the data within the client table is physically ordered by SSN. The table can now be considered a clustered table. In addition to reordering the existing rows in the table, new rows will be inserted in order, based on the SSN value being inserted. Figure 1 illustrates the difference between inserting a record into the client table as a heap compared to a version with a clustered index on SSN.

Unlike nonclustered indexes, which we'll cover shortly, a table can have only one clustered index; that is, you can physically order the rows in a table in only one way. In order to enable fast access to the data within a table based on the value of other column(s), we can create additional (up to 999) nonclustered indexes.

More than just an index

Despite the name, a clustered index is more than just an index. It also contains the table data itself, which is stored at the leaf level of the index. By default, a primary key constraint will be created as a clustered index, thereby physically ordering the table rows based on the value of the primary key column(s).

Take our client table above (and assume it was populated with millions of rows). If we wanted to access a client, and we had their SSN, we would issue a query similar to this:

SELECT *
FROM dbo.client
WHERE SSN = '111-622-3033'

Given the client table is physically ordered by SSN, SQL Server can very quickly access the required record(s). Let's assume now that we wanted to find a client based on their date of birth and surname, as per the following query:

SELECT *
FROM dbo.client
WHERE
   DOB = '18 Jun 1974'
   AND surname = 'Smith'

In this case, since the table is ordered by SSN and there are no other indexes in place, SQL Server is forced to scan through the table a row at a time, looking for matching records. On small tables, the performance overhead involved in such a scan wouldn't be noticed. However, as the table grows, so too does the performance hit. To address this, we can create additional nonclustered indexes.

3. Nonclustered indexes

A nonclustered index is created using the create nonclustered index command, as per the following example;

CREATE NONCLUSTERED INDEX ixClientDOBSurname ON dbo.client(DOB, surname)
GO

What we've done here is create an index on the combination^[] of the DOB and Surname columns. In doing so, a separate physical index structure containing, and ordered by, DOB and Surname is created and maintained in line with the table. Each time a row is inserted, updated, or deleted from the client table, the corresponding updates are made to the nonclustered index. When running a query such as the one above that selects data based on the DOB/Surname combination, the index is used, or looked up, with the leaf level of the matching index entries pointing to the appropriate records in the table.

^[] An index created on more than one column is referred to as a composite index.

A good way of understanding the difference between clustered and nonclustered indexes is thinking about a paper-based phone book, ordered by surname. If you're looking for the address of someone whose surname is White, you'd immediately flip to the back of the book and adjust your search from there. In this sense, the phone book can be considered to be clustered on Surname.

On the other hand, if all you had was a phone number and you needed the matching address, your only choice would be to scan every page of the book, an immensely time-consuming process! If there was a section added to the rear of the book containing ordered phone numbers with a corresponding name, you could then flip to the appropriate page to retrieve the address. This extra section of ordered phone numbers can be considered a nonclustered index.

Taking this example further, let's imagine email addresses were added as additional contact information. If all we had was someone's email address, and we wanted their street address (or phone number), we'd be back to scanning through the book from start to finish. As we did with phone numbers, we could build another section at the back of the book containing ordered email addresses with a corresponding name to enable fast lookup to the section of the book containing the contact details. Figure 2 illustrates this concept further.

As per database tables, each time we add indexes to the rear of our phone book, two things happen. First, the size of the book increases, and second, we have additional information to maintain. If anyone changes their name, phone number, or email address, as well as updating the appropriate page within the book, we'd have to update one (or both) of the additional indexes. The additional size and maintenance overhead are important considerations when creating nonclustered indexes.

Figure 2. A hypothetical enhanced paper-based phone book with additional indexes allowing lookups based on phone numbers and email addresses

Before we complete our brief overview of clustered and nonclustered indexes, let's examine their internal structure a little further.

Full-text indexes

In addition to the index types covered in this book, SQL Server includes support for full-text indexes. Unlike the basic partial-text matching offered by the LIKE command, full-text indexes enable optimized and comprehensive searching against large volumes of unstructured data, including linguistic searching based on the installed language set.

4. Index structure

One of the important aspects of nonclustered indexes we've yet to cover is how they link to the records in the base table. The section of the nonclustered index that points to the record in the base table is known as a row locator. If the base table is a heap (no clustered index) the row locator is a pointer to the appropriate table row, identified with a row ID. The row ID consists of the file ID, the page number, and the row number on the page. If a table contains a clustered index, a nonclustered index's row locator is the row's clustered index key.

Indexes within SQL Server (both clustered and nonclustered) are implemented as B-trees. An index seek starts at the top, or root node, of the tree and traverses through intermediate levels before arriving at the leaf nodes. In a clustered index, the leaf nodes contain the actual table data. In a nonclustered index, the leaf nodes contain the row locators. Figure 3 illustrates this structure further.

With SQL Server including the clustered index key value in the row locator of nonclustered indexes, two important design points emerge. First, the width, or size, of the clustered index directly affects the size of each nonclustered index. For example, a single integer-based column is much narrower or smaller than four char(100)-based columns. It follows that from both storage efficiency and lookup performance perspectives, clustered indexes should be as small and as narrow as possible.

Figure 3. A nonclustered index lookup will traverse the B-tree until it reaches the leaf node, at which point the row locator is used to locate the data page in the clustered index.

The other design point to emerge is that the clustered index should be created before the nonclustered indexes. Think about what would happen if a table was created as a heap with nonclustered indexes. As we covered previously, each nonclustered index would use the heap's row ID as its row locator. If a clustered index was then created on the table, each of the nonclustered indexes would need to be updated to include the clustered index key in place of the row ID.

Finally, an important property of a clustered index is whether it's created with a unique constraint. A nonclustered index's row locator must point to a single row, so if the clustered index is not unique, which row will the row locator point to? In addressing this issue, SQL Server will make nonunique clustered indexes unique by adding a hidden uniqueifier value, which distinguishes rows with the same clustered index value.

The use of the row locator to link a nonclustered index row to its corresponding heap or clustered index row is typically referred to as a bookmark lookup; however, in SQL Server 2008, a bookmark lookup operation no longer exists and has been replaced with clustered index seeks, RID lookups, and key lookups.^[] Which of these is used is determined by the presence (or absence) of a clustered index and whether SQL Server chooses to use an index or scan the table. Let's explore this a little further with a look at the key lookup process.

^[] Despite this, the term bookmark lookup is still commonly used to represent the process of locating table data based on an index key value.

5. Key lookup

When SQL Server uses a nonclustered index to fulfill a query request, it uses a key lookup to retrieve the data from the clustered index (and a RID lookup in the case of a heap). When a nonclustered index returns one (or few) rows, this is an efficient and fast operation. However, as the number of matching rows increases, the combined cost of the index seek plus key lookup operations increases to the point where it may be quicker to simply scan the base table. Consider the following two queries:

-- Query A
SELECT * FROM client WHERE SSN = '191-422-3775'

-- Query B
SELECT * FROM client WHERE surname = 'Smith'

In Query A, an index lookup on SSN will return a single row, assuming each client has a unique SSN. In Query B, we're looking for people with a surname of Smith. Assuming the client table has millions of rows, this query could return thousands of matching rows.

Let's assume that the client table has two nonclustered indexes to support each of the above queries—one on SSN and the other on surname—with a clustered index on clientCode. In the first query, the SSN index seek would return one row, with a single key lookup required on the clustered index to return the remaining columns. Contrast this with the second query, which may return thousands of instances of Smith, each of which requires a key lookup. Because the base table is clustered on clientCode, each of the resultant Smith key lookups will be fulfilled from different physical parts of the table, requiring thousands of random I/O operations.

Depending on the number of rows to be returned, it may be much faster to ignore the index and sequentially scan the table's clustered index. Despite reading more rows, the overall cost of a single, large, sequential I/O operation may be less than thousands of individual random I/Os.^[]

^[] It's common for nonclustered indexes requiring key lookups to be ignored in favor of clustered index scans if more than 1 percent or 2 percent (sometimes less) of the table's rows are estimated to match the query. Such is the cost placed on random I/O.

A couple of interesting points emerge from this. The first is the possibility of including additional columns in the nonclustered index to avoid the key lookup. For example, rather than select *, we specify a limited number of query columns, each of which is added to the nonclustered index definition. By doing so, the query request is satisfied from the contents of the nonclustered index alone, avoiding the need for key lookups to the clustered index.

Second, how does SQL Server know how many key lookups will be involved in fulfilling a query request? How does it decide that it's more efficient to do a table/clustered index scan? Enter statistics.

6. Statistics

When indexes are first created, SQL Server calculates and stores statistical information on the column values in the index. When evaluating how a query will be executed, that is, whether to use an index or not, SQL Server uses the statistics to estimate the likely cost of using the index compared to other alternatives.

The selectivity of an index seek is used to define the estimated percentage of matched rows compared to the total number of rows in the table. Broadly speaking, the overall selectivity of an index can be defined as follows:

index selectivity = number of distinct index keys / table row count

The selectivity of an index will range from 0 to 1, or the equivalent value expressed as a percentage. Looking at the extreme ends of the selectivity scale, a unique index represents the best possible selectivity of 1, or 100 percent; every row has a unique (distinct) value. Primary key indexes are perfectly selective. In contrast, an index on a column where every row has the same value (approaching 0 percent selectivity) is obviously of limited/zero value.

By keeping statistics on the distribution of column values in the index, the selectivity of the query can be estimated. As an example, if there are 1020 Smiths and 2 Zatorskys in a surname index, a search on Zatorsky is far more selective than a search on Smith and therefore is more likely to use an index lookup.

An important aspect of statistics is whether they remain up to date. Consider an example where an index starts out being highly selective. A query that performs an equals search on such an indexed column would obviously perform very well. But if a large volume of rows was added to the table with the same indexed column value, the selectivity would lower, potentially dramatically, depending on the volume of rows added and the contents of the indexed column. At this point, the same query, using the same index, may then perform very poorly. To counter situations such as this, SQL Server has a number of settings for automatically updating statistics.

With this background in mind, let's move on to look at the index design process.

Other

USB Hard Drive Hangout (Part 2) - Toshiba Canvio plus 3.0 1.5TB, Adata Dashdrive Elite 500GB HE720

USB Hard Drive Hangout (Part 1) - WD My Passport 2TB

WD 4 TB Black Review – It’s The One To Get

WD My Passport Edge - Fast Mobile Drive For Mac 500GB

Seagate Slim Portable – Thin Mobile Drive

SSDs: Solid And Reliable

Sandisk ReadyCache 32GB - Optimum Your Space To Cache

Silverlight : Data Binding - Providing reasonable defaults for bound data

Silverlight : Data Binding - Controlling Updates

Silverlight : Data Binding - Validating Input for Bound Data