1. Index statistics
The default statistics configuration enables SQL
Server to automatically create and update statistics on all indexed
columns. This setting reduces maintenance requirements and ensures SQL
Server makes accurate decisions when creating query execution plans.
In cases where the index has multiple columns,
statistics are also kept for the combination of column prefixes, for
example, an index on LastName, FirstName, DOB will keep statistics on
(LastName), (LastName + FirstName), and (LastName + FirstName + DOB).
This enables the most accurate estimate to be made on the index's
selectivity when evaluating a query containing predicates on one or more
of these columns.
In a case where LastName and DOB are specified
(but not FirstName), the selectivity would be based on LastName alone.
For this reason, if the most selective column is always supplied as a
search predicate, and other index columns are not, then it makes sense
for the most selective column to be placed first in the index column
order.
When an index is rebuilt using one of the various options (REBUILD, CREATE WITH DROP EXISTING, or DROP/CREATE), the statistics are updated in full. A common mistake made when creating maintenance plans is to run the UPDATE STATISTICS command after an index rebuild. When run with the default parameters, UPDATE STATISTICS
creates statistics based on a sample of rows. When run in this manner,
the full statistics from the index rebuild are replaced with
less-accurate statistics.
|
In almost all cases, the default database
settings for statistics (Auto Create Statistics and Auto Update
Statistics) should remain in place. These settings, accessible via the
Database Properties page as shown in figure 1, ensure that SQL Server automatically manages both the creation and update of statistics information.
When the query optimizer compiles a query plan
requiring statistics, the statistics are automatically updated if they
are detected as being out of date. Statistics are deemed as being out of
date for a variety of reasons, primarily when a certain percentage of
rows have been modified since the statistics were last updated. In such
an event, the statistics are updated in line;
that is, the query compilation pauses, waiting for the statistics
update to complete. Such events, which can be confirmed using a SQL
Profiler trace containing the Auto Stats event, may lead to
occasional, unpredictable query performance, a situation that sometimes
leads to DBAs switching off the automatic statistic options in a bid for
more predictable query performance.
While automatic statistic events may result in
an occasional query execution delay, the benefit of automatic statistics
maintenance must be considered, particularly its advantage of avoiding
poor query plans based on old statistics. The alternative of
hand-crafting customized statistics is simply far too time consuming and
inaccurate for all but the most specialized of cases.
Fortunately, a compromise setting was introduced
in SQL Server 2005. The Auto Update Statistics Asynchronously option,
set using the database property shown earlier in figure 1, will trigger automatic statistic update events asynchronously;
that is, a query compilation will not wait on the new statistics and
will proceed with the old, with subsequent queries benefiting from the
newly updated statistics. Using this setting is a trade-off between the
most accurate query plan and predictable query performance.
In addition to creating and maintaining indexed
column statistics, SQL Server creates and maintains statistics on
nonindexed columns. Such statistics are called column statistics.
2. Column statistics
Along with index statistics, you can view column statistics in SQL Server Management Studio, as shown in figure 2. Column statistics are named with a _WA_Sys prefix.
So why and how are column statistics useful? We
know that unless an appropriate index exists on a query's predicate
column(s), the statistics are not used for evaluating the usage of an
index, because there is no index. The
answer lies in the query optimizer needing an accurate indication of the
likely number of rows that will be returned from a query in order to
select the best query plan, in particular, the appropriate join logic.
For example, consider this query:
-- Statistics on large + red products may help to optimize the join ...
SELECT orders.*
FROM dbo.product
INNER JOIN dbo.orders on product.productId = orders.productId
WHERE
product.color = 'red'
AND product.size = 'large'
Assuming there are no indexes on the color
and/or size columns, what good is it to maintain statistics on them? How
many rows SQL Server thinks will be
returned after applying the filter condition on the products table
determines the type of join operation that will be used, that is, hash
join, nested loop, or merge.
For nonindexed columns, statistics are
maintained only for singular column values. In the previous example,
statistics would be automatically created and maintained on the color
column but not the combination of color and size. Consider a case where
the table contained a million records with red products but only two
with large red products. The quality of the compiled execution plan
depends on whether the query optimizer has the required statistics.
Depending on the selectivity of color, compared to color and size, it
may be worthwhile to manually create multicolumn statistics, as per the
following example:
-- Create custom column statistics
CREATE STATISTICS Product_Color_Size
ON dbo.product (color, size)
WITH FULLSCAN
In addition to the CREATE STATISTICS command shown above, SQL Server provides a number of other commands enabling manual control over statistics.
3. Manually creating/updating statistics
Assuming the automatic statistics options are in
place, there are a few cases that call for manual intervention. As
we've just discussed, creating multicolumn statistics on combinations of
certain nonindexed columns is one such case. Another may be where the
default sampling frequency is inadequate and is producing inaccurate
plans.
DROP STATISTICS—This
command enables a specified statistics set to be dropped, whether
created manually or automatically. Index statistics, however, cannot be
dropped.
CREATE STATISTICS—This
command can be used to create statistics for a supplied column or set
of columns. If you're using the default automatic statistics settings,
this command is typically used only to create multicolumn statistics, an
example of which you saw earlier. New to SQL Server 2008 is the ability
to create filtered statistics, which, similar to filtered indexes,
maintain statistics for a subset of data.
sp_createstats—This
command creates single-column statistics for all eligible columns in
all user tables in the database in which the command in executed.
UPDATE STATISTICS—Updates
statistics for a given statistics name. Again, if the automatic
statistics settings are in place, this command is typically used only
when statistics are suspected of being out of date. Running this command
enables a more frequent refresh compared to the automatic default.
sp_updatestats—This command runs UPDATE STATISTICS against all user tables using the ALL keyword, therefore updating all statistics maintained for each user table.
With the exception of DROP STATISTICS,
all of the above commands take an optional parameter for the sampling
rate. Without specifying a value, SQL Server estimates the appropriate
number of rows to inspect, with the goal of striking a balance between
useful statistics and low impact. In all cases, the FULLSCAN (or 100 percent) option can be used to sample every row, thereby achieving maximum accuracy.
As we covered earlier, rebuilding an index will
update the statistics with the equivalent of a manual full scan. A
common reason for performing a full index rebuild is to ensure the
statistics are kept at their most accurate. However, as we also
discussed, unnecessary index rebuilds create a lot of transaction log
data, in turn causing potential issues with log shipping and mirroring
solutions. If indexes are being rebuilt for the express purpose of
maintaining statistics, that is, fragmentation levels are not of
concern, then manually running UPDATE STATISTICS using the FULLSCAN option is perhaps a better choice; statistics will be updated in full, without the overhead of the index rebuild.
4. Inspecting statistics
The DBCC SHOW_STATISTICS command can be used to view the statistics information for a given index, as per the example shown in figure 3.
DBCC SHOW_STATISTICS provides a great
deal of information useful in inspecting the statistics for a given
index. The output is grouped into three sections, referred to as
STAT_HEADER, DENSITY_VECTOR, and HISTOGRAM:
STAT_HEADER—Contains
information including the date and time of the last stats update,
number of rows sampled versus table rows, and any filtering conditions.
DENSITY_VECTOR—Contains
the length and selectivity of each column prefix. As discussed
previously, stats are kept for all depth levels of the index.
HISTOGRAM—
The histogram is the most descriptive section of output, containing the
actual sampled values and associated statistics such as the number of
records that match the sampled value.