SQL Server 2008 : Index analysis (part 2) - Identifying indexes to add

2/20/2013 8:40:10 PM

2. Identifying indexes to add

The analysis and potential removal of duplicate, unused, or infrequently used indexes can be periodically scheduled as a weekly or monthly maintenance task. In contrast, the addition of indexes is usually the result of a performance-tuning exercise, often started to investigate the sudden or gradual performance decline in one or more queries.

When a query is submitted to SQL Server for execution, the query optimizer determines the ideal index structures for executing the query. If such indexes exist, they're used; otherwise, a suboptimal plan is created with the details of the missing/ideal indexes stored for later reference. Such details can be accessed using one of two methods: the XML query plan and Dynamic Management Views.

Indexes on foreign keys

A common performance problem is related to the absence of foreign key indexes. In almost all cases, an index on a foreign key column will improve performance. For example, in an orders table, there is typically a foreign key to a customers table, with a very common query being something like select * from orders where customerId = 623901. In this example, an index on the customerId foreign key would most likely be beneficial.

XML QUERY PLANS

In its simplest form, the missing-indexes feature can be used for individual queries by inspecting the <MissingIndexes> element of the XML execution plan of a query executed in SQL Server Management Studio. As an example, let's imagine an application change was recently made to support searching AdventureWorks orders by a partial string match on the PurchaseOrderNumber column. Such a query would look like this:

-- Generate a missing index event
SELECT OrderDate, CustomerID, SubTotal, SalesPersonID
FROM Sales.SalesOrderHeader
WHERE PurchaseOrderNumber like 'PO166%'

With no clustered or nonclustered index on the PurchaseOrderNumber column, this query is forced to use a clustered index scan. Let's execute the query after running SET STATISTICS XML ON. The results are shown in figure 2.

Figure 2. Query results with XML statistics

Note the XML that's returned below the result set. Clicking on the XML will open the graphical query execution plan. Right-click the resultant plan and choose Show Execution Plan XML. An example of the resultant XML is shown in figure 3.

The part of the XML plan we're interested in is highlighted. The <MissingIndexes> element contains the details of the ideal index that the query optimizer identified as being missing. In this particular case, the suggested index contains the PurchaseOrderNumber column as an index key with the other columns as included columns. Further, the Impact attribute of the <MissingIndexGroup> element estimates a 91.9624 percent improvement in query performance if this index is in place.

Figure 3. XML statistics with missing indexes highlighted

Using this information, the create statement for the suggested index would be created as follows:

CREATE NONCLUSTERED INDEX ixSalesOrderHeader_PurchaseOrderNumber
ON Sales.SalesOrderHeader(PurchaseOrderNumber)
INCLUDE (OrderDate, CustomerID, SubTotal, SalesPersonID)

Rerunning the query with the above index in place changes the execution plan from a clustered index scan to a nonclustered index seek without the need for key lookups (courtesy of the included columns). Running the query with SET STATISTICS IO ON confirms the reduction of page reads and the estimated 91 percent improvement suggested in the XML we saw earlier.

Now, this does not mean we should run out and immediately add this index. The recommendation for this particular example is made in isolation from the rest of the workload on the server. If this was a one-off ad hoc query, then adding this index would likely result in a net decrease in performance, given its maintenance requirements and the fact that it would be used infrequently. However, if this is a common query, then perhaps it's a candidate for addition.

SET STATISTICS XML ON

In SQL Server versions prior to 2005, query execution plans could be viewed in text or graphical format. Graphical plans are nice, except when they span more than one screen and you need to send them to someone for analysis. On the other hand, text plans are sometimes difficult to read, especially complex plans. XML plans offer the best of both worlds. They can be saved as XML files for distribution, and when viewed in SQL Server Management studio, they're shown graphically with a right-click option for displaying the XML. Further, the XML can be inspected for details on missing indexes and viewing the compiled-versus-runtime parameter values for assistance in diagnosing parameter-sniffing problems.

In essence, XML query plans allow us to access missing index information for a particular query. While this is valuable, what would be really nice would be the ability to look back on previous production workload to identify all of the missing indexes. This is possible using the sys.dm_db_missing_index DMVs.

SYS.DM_DB_MISSING_INDEX DMVS

As we discussed earlier, when the query optimizer uses a suboptimal query plan, it records the details of the missing indexes that it believes are optimal for the query. In addition to viewing these details for a particular query using the XML plans, we can access these details for all queries^[] since the SQL instance last restarted through the sys.dm_db_missing_index DMVs. There are four DMVs in this group:

^[] Up to 500 missing index groups are maintained by these DMVs.

sys.dm_db_missing_index_details—This DMV returns one record for each missing index and contains columns for suggested key and included columns. Each missing index is identified with an index_handle.
sys.dm_db_missing_index_groups—This DMV acts as a mapping between sys.dm_db_missing_index_details and sys.dm_db_missing_index_group_stats. It contains two columns: index_handle and group_handle.
sys.dm_db_missing_index_group_stats—Each time the same index is identified by the query optimizer as missing, its potential value grows. This DMV stores the accumulated statistics on the index's potential value. Identified with group_handle, the accumulated stats can be traced back to the details of the missing index via the sys.dm_db_missing_index_groups DMV.
sys.dm_db_missing_index_columns—This dynamic management function takes an index_handle as a parameter and returns the missing index's columns in a table format with a column indicating the suggested index column type.

To understand how each of these DMVs operates, let's use our example query from earlier:

SELECT OrderDate, CustomerID, SubTotal, SalesPersonID
FROM Sales.SalesOrderHeader
WHERE PurchaseOrderNumber like 'PO166%'

After restarting our SQL instance to clear the contents of the DMVs, we'll run this query, along with inspecting the results of the four DMVs. There are a couple of points to note about what we've done here. First, like the sys.dm_db_index_usage_stats DMV, these DMVs are cleared with each instance restart, so any decision based on their content should be made after the instance has been up and running for a period of time that covers the full range of activity. Second, the value of the sys.dm_db_missing_ index_group_stats DMV is that it accumulates statistics on the potential value of a missing index after repeat misses. In our case, we've missed the index only once since the instance was started, so the figures shown for this DMV are for a single miss. So with these points in mind, let's examine the output from the DMVs, shown in figure 4.

First, you'll note that the equality column for sys.dm_db_missing_index_details is empty. This is because there are no equality predicates in our query, for example, where x=10. We used like, so this appears as an inequality column. Second, note that sys.dm_db_missing_index_columns is a function that takes an index_handle as an input and returns index columns in a tabular fashion along with the column usage (equality, inequality, include). In contrast, sys.dm_db_missing_index_details returns one row per missing index.

The real value of these DMVs is when the instance has been up and running for a period of time. This allows us to take advantage of the accumulated statistics from the sys.dm_db_missing_index_group_stats DMV.

avg_total_user_cost—This column represents the reduction in query cost if the suggested index was present.

Figure 4. Missing index information returned from various DMVs
avg_user_impact—This represents the estimated percentage improvement of the query with the index present.
user_seeks and user_scans—Each time the query is executed without the suggested index, these values increment to represent the number of times the index could have been used in either a seek or scan capacity.

When selecting from the sys.dm_db_missing_index_group_stats DMV, we can order by the above columns to show the indexes with the greatest potential for improvement. To simulate this, we can rerun our SalesOrderHeader query multiple times and then query the DMV ordering by the above columns. Listing 1 shows the code for this. Note the GO 10 command after the SalesOrderHeader query, used to execute the query 10 times. In addition to this query, we execute another query on the DatabaseLog table, which also generates a missing index event. This allows us to view multiple missing indexes ordered by potential value.

Example 1. Simulating activity to view missing indexes by priority

-- Simulate a weighted query workload by using GO 10 for 1 statement
SELECT OrderDate, CustomerID, SubTotal, SalesPersonID
FROM Sales.SalesOrderHeader
WHERE PurchaseOrderNumber like 'PO166%'
GO 10

SELECT DatabaseUser, TSQL
FROM dbo.DatabaseLog
WHERE Event = 'ALTER_TABLE'
GO

Listing 2 shows the code to view the missing index information generated by the above two queries, weighted by potential value.

Example 2. Missing index DMV query with potential weighting

-- View missing indexes Weighted by potential value
SELECT
     sc.name + '.' + OBJECT_NAME(details.object_id)
     , details.equality_columns
     , details.inequality_columns
     , details.included_columns
     , stats.avg_total_user_cost
     , stats.user_seeks
     , stats.avg_user_impact
     , stats.avg_total_user_cost * stats.avg_user_impact
      * (stats.user_seeks + stats.user_scans) as potential
FROM sys.dm_db_missing_index_group_stats stats
     INNER JOIN sys.dm_db_missing_index_groups groups
       ON stats.group_handle = groups.index_group_handle
     INNER JOIN sys.dm_db_missing_index_details details
       ON details.index_handle = groups.index_handle
    INNER JOIN sys.objects o
       ON o.object_id = details.object_id
    INNER JOIN sys.schemas sc
      ON o.schema_id = sc.schema_id
ORDER BY potential desc

The calculated potential column takes into account the columns from the sys.dm_db_missing_index_group_stats that we covered earlier. By executing the query on the SalesOrderHeader table 10 times (using the GO 10 command), the potential value of the missing index on this table is increased by virtue of the user_seeks column, together with the avg_total_user_cost and avg_user_impact columns. The output from the query in listing 6 is shown in figure 5.

The potential value for the missing indexes is calculated by multiplying the sum of seeks/scans against the user cost and impact columns. Because the SalesOrderHeaderquery was executed 11 times, its potential weighting is greater compared to a single execution of the query on the DatabaseLog table.

Figure 5. Missing index DMV query with weighting potential

The full value of the missing index DMVs can be exploited as part of a regular maintenance routine. For example, weekly checks on index usage to remove duplicate/unused indexes could be combined with checks on the potential for missing indexes using a query similar to the previous one.

Despite the power of the missing index DMVs, there are some limitations. The major ones are as follows:

A maximum of 500 missing indexes will be kept.
The DMVs will be cleared when the SQL instance is restarted or the table definition changes.
While the missing indexes report the columns to include (key and included columns), they don't suggest a column order for the index.
Certain index types, including filtered indexes, clustered indexes, and indexed views, are not suggested.

The Database Engine Tuning Advisor, covered next, addresses some of these shortcomings.

Database Engine Tuning Advisor

The Database Engine Tuning Advisor, accessible in the Performance Tools folder of the Microsoft SQL Server 2008 program group, analyzes workload from either a T-SQL file or SQL Trace file/table. T-SQL files are typically used to analyze a limited set of commands, whereas SQL traces, are used to analyze workload over a period of time, for example, a 24-hour period covering typical production activity.

Once the workload input has been specified, options can be configured to limit the scope of recommendations that the tuning advisor will consider. Figure 6 shows the broad range of options available for consideration.

Figure 6. The Database Engine Tuning Advisor can be used to examine a workload and suggest various index types to improve performance.

As shown in figure 6, the analysis options are far greater than what can be achieved using the missing index DMVs that we covered earlier. Further, the analysis can be based on a wide range of production activity captured in a trace file (or table) for future analysis.

Once the workload input has been selected and a target database chosen, analysis can begin. When the analysis is complete, recommendations can be viewed, an example of which is shown in figure 7

One of the clear advantages of using the tuning advisor over other index analysis methods is the ease with which the recommendations can be turned into actual T-SQL code for implementation. As shown in figure 7, you can click the Definition column to display the appropriate T-SQL code to effect the recommendation. Further, the tool can be used for recommendations on indexes to drop as well as those to add.

As per the recommendations from the missing index DMVs, you should carefully consider indexes suggested by the tuning advisor before implementing them. Among others, considerations include these:

Is the workload sufficient to cover the full range of production activity?
Can the suggested recommendations be load/volume tested before production implementation?