In this section we look at new features in SQL Server 2008 for more efficiently storing columns with a large number of NULL values as well as for compressing the data in a table.
Sparse Columns
Sparse columns make it possible to store columns more efficiently where a large percentage of the rows will be NULL.
Often, we need to include columns in our tables that only occasionally
have values actually supplied for them. A column that stores an optional
middle initial is an example. Most people don’t bother to include their
middle initial when entering their names, but occasionally somebody
will. The large percentage of rows that don’t have a value for the
middle initial may then be NULL. By declaring the middle initial column with the SPARSE keyword, SQL Server does not consume any physical storage for the column in all rows where the column value is NULL.
There is a trade-off, of
course. The columns that do contain values actually take more storage
than nonsparse columns of the same data type. This means that there is a
break-even point we must cross in terms of the percentage of rows that
must contain NULL
in that column for any actual space savings to occur. For example, the
SQL Server documentation states that a nonsparse integer column that
contains a value takes four bytes of storage, whereas an integer column
marked as sparse requires eight bytes to store the same value. That is
double the cost for each non-null integer value stored. If you do that
math, then more than 50% of the rows must be NULL for there to be any space savings. The SQL Server documentation actually states that 64% of the rows have to be NULL
to realize a savings. If that is the case, however, the savings could
be worth it. All data types can be sparse with the exception of
geography, geometry, image, ntext, text, timestamp, and user-defined
data types.
If a table has more than one
sparse column, we can work with all of the sparse columns using a Column
Set. A Column Set is an XML column that allows us to work with all the
sparse columns in the same table as a single entity. If there are a
large number of sparse columns, it may be more convenient or efficient
to treat them as a single set of data rather than as individual columns.
For a large number of rows, working with the values as a Column Set
could provide a performance increase as well. We include a Column Set by
adding an XML column to the table’s definition and then marking that
column with COLUMN_SET FOR ALL_SPARSE_COLUMNS.
We can still address the sparse columns directly by name, but SQL
Server defaults to showing only the Column Set column, for example, when
we do a SELECT*.
Sparse columns are
good candidates for inclusion in filtered indexes. A filtered index is a
special index that includes only rows that match the criteria specified
in a WHERE clause that is part of the CREATE INDEX statement. Since sparse columns by definition should contain mostly NULL values, filtered indexes on the non-null values in sparse columns would only have to index a small percentage of the rows.
The
following sample code creates a table to store demographic details we
hope to collect from a customer survey. It is likely that the
individuals completing the survey will leave a large number of fields
empty. We want to optimize the storage of those empty (NULL) values by using sparse columns. Notice that the MaritalStatus column has not been marked sparse. The documentation states that 98 percent of the rows would have to be NULL for us to realize a space savings on the bit data type by marking it SPARSE. More than 2 percent of the surveys will likely be received with the MaritalStatus column completed, so we won’t mark it as SPARSE.
We will include a Column Set definition to allow us to work with all
the demographic values as a single set of XML data, and finally we will
create a filtered index on the income field to help us quickly find
customers based on their income:
USE AdventureWorks2008;
GO
CREATE TABLE Person.Demographics
(
BusinessEntityID int NOT NULL PRIMARY KEY,
Age int SPARSE,
BirthDate date SPARSE,
Income int SPARSE,
Cars int SPARSE,
Children int SPARSE,
MaritalStatus bit NULL,
AllFields xml COLUMN_SET FOR ALL_SPARSE_COLUMNS
);
CREATE NONCLUSTERED INDEX FIDemographicsIncomes
ON Person.Demographics(Income)
WHERE Income IS NOT NULL;
The main advantage of
using sparse columns is space savings. Smaller rows means fewer rows per
page, fewer pages per table, less IO per query, less buffer cache per
query, and so on. SQL Server 2008 now also includes the ability for us
to compress the rows and pages of our tables explicitly. The next
session talks about just that.
Compression
SQL Server 2008 is
the first version of SQL Server that supports internal data compression.
We have been able to use NTFS compression in previous versions to
compress read-only
filegroups and partitions. With SQL Server 2008, we can have the data
engine itself perform compression at either the row or page level, and
we can
define those compression settings on a partition-by-partition basis. It
should be noted that SQL Server 2008 also supports backup compression,
but that is beyond the scope of this book. In this section we will talk
specifically about Row and Page compression in our tables and indexes.
As with sparse columns, the
big benefit of compression is a savings in storage costs. If we can
compress our rows and pages, we can fit more rows per page and reduce
the amount of space it takes to store the same amount of data. That
storage cost savings can then translate into reduced performance costs
because it requires less IO to read those pages. Of course, we don’t get
it for free. Compression will increase CPU costs as we compress and
decompress the data. So as it is with many other things in SQL Server,
there are times to compress and times not to compress.
A number of
compression algorithms are available today, but two common practices to
avoid taking space for empty values are to store patterns of data that
repeat and to store them a single time. SQL Server can use both of these
methods through its Row and Page compression options.
Row compression takes the
approach of not storing “empty” bytes. It effectively treats all column
values as if they were variable width. This can save a lot of space if
you have columns that aren’t completely full. It doesn’t take an
excessive amount of CPU usage to effect Row compression, so if it gives
you enough of a storage savings, you won’t need Page compression.
Page compression
implies Row compression, but it also looks for and reduces the
occurrence of redundant data. A lot more work is involved in Page
compression, and so the CPU utilization will be significantly higher
than that of just Row compression, but with enough redundant bytes of
data, it could yield much higher space savings.
Deciding whether to use Row
or Page compression comes down to a question of the amount of space
saved and the amount of CPU cost to get that savings. If Row compression
provides acceptable savings in space, then stay with that because it
will be cheaper than Page compression. If Row compression does not offer
enough space savings, however, we can then step up to Page compression.
Warning
The Row and Page
compression features are available only in the Enterprise, Evaluation,
and Developer editions. Watch out for questions that try to trick you
into the wrong answer based on the edition.
Row Compression
Row
compression effectively asks that SQL Server store all fields as if
they were variable length. By effectively treating every data type as if
it were variable length, it is possible that we can realize an overall
reduction in storage. SQL Server 2005 SP2 introduced a table option
called vardecimal.
By setting this option on a table, you could effectively store all
fields with the decimal and numeric data types as if they were variable
length. The new Row compression option supersedes the vardecimal behavior and extends the variable-width storage to all data types.
For example, imagine a table with a char(10)
field for storing postal codes. We need all ten characters for rows
where the postal code uses the “zip + 4” format (00000–0000). For the
rows that have only five characters in their postal code, however, the
extra five characters are wasted space. By storing the characters as a
variable-width field, SQL Server could save five bytes of storage for
the shorter postal codes. The variable-width data, however, has some
overhead associated with it because we need a map, stored as a series of
offsets, of where the data for each field is. That means that if the
fields don’t really vary in length (they all use “zip + 4”), it is less
efficient to store them as variable-width data than as fixed width.
We indicate that a table or an index should be compressed via the respective CREATE and ALTER statements. The following example creates a table using Row compression:
CREATE TABLE Student
(
SomeID int IDENTITY(1,1) PRIMARY KEY,
FirstName nvarchar(50) NOT NULL,
LastName nvarchar(50) NOT NULL,
Age int NOT NULL,
ModifiedDate datetime NOT NULL,
) WITH (DATA_COMPRESSION=ROW);
In this example, the storage of the FirstName and LastName fields does not change much because they are based on nvarchar, but other int and datetime fields can now also be stored as variable width because the Row compression option was enabled.
Page Compression
Row compression
does not store empty bytes on the page. Page compression, however,
actually identifies and reduces the occurrences of repeating patterns in
the columns row on the page. Page compression implies Row compression,
and then adds column Prefix and Page Dictionary compression methods.
In
column Prefix compression, SQL Server identifies the single most common
set of bytes at the beginning (from left to write) of the column values
for all rows of a given column. It then stores that common set of bytes
for each column in a “Compression Information” record in the page. The
column values in the rows can then reference the single occurrence of
the common byte sequence rather than storing it repeatedly for each row
where it occurred. Rows that began with all or even some of the common
bytes can be compressed based on the number of bytes they shared with
the common prefix. Rows that did not begin with any or all of the common
set of bytes are not compressed at all.
Since Row compression is
done on a single column at a time, it can only compress data when
multiple values in a column start with the same bytes. For example, if a
table had a single column of first name values and only ten rows, and
if each first name started with a different character, Row compression
would have no benefit because there would be no common byte patterns at
the beginning of the column values. With a larger number of rows, and an
initial sequence of characters shared across a large number of rows,
then Row compression could have a significant impact on the total
storage. To find repeating patterns of data across all rows and columns,
SQL Server performs Page Dictionary compression.
In Page Dictionary compression, SQL Server creates a dictionary (or list) of all common byte patterns across rows and
columns on the page. It is no longer looking for only common prefixes
on a column-by-column basis. Instead it is looking for all byte patterns
that are repeated in any column of any row. It stores this list of
common patterns in the same “Compression Information” structure in which
it keeps the column prefix values. It then replaces any occurrences of
those common byte patterns in the columns of the rows on the page, with a
reference to the patterns position in the dictionary.
The following example alters the table we made in the Row compression example and changes it to Page compression:
ALTER TABLE Student
REBUILD WITH (DATA_COMPRESSION=PAGE);
Individual index
and table partitions can have different compression settings. We can opt
on a partition-by-partition basis to have either no compression, Row
compression, or Page compression. If we have a table partitioned by
time, for example, we could leave the partition with the current month’s
data uncompressed, use Row compression on the partitions with the
month’s data, and finally use Page compression on the partition with all
previous data in it. The following example shows the syntax for
compressing a table with multiple partitions.
CREATE TABLE TimePartitioned
(SomeKey int, SomeValue nvarchar(50), TheDate date)
ON TimeScheme(TheDate)
WITH
(
DATA_COMPRESSION = NONE ON PARTITIONS (1),
DATA_COMPRESSION = ROW ON PARTITIONS (2),
DATA_COMPRESSION = PAGE ON PARTITIONS (3),
);
Turning
compression on will have some implied costs in terms of both the
ongoing costs to maintain compression and the initial costs to compress
data already in the table or index. We should therefore first verify
that we will get a benefit from compression before we enable
compression.
Estimating Compression Savings
We can estimate how much
space compression should save us before we actually enable it. SQL
Server provides a system stored procedure named sp_estimate_data_compression_savings.
The procedure takes a sampling of the actual pages in the target table,
or index(es), performs compression on those pages, and measures the
space saved. It then estimates what the compression would be for all
pages based on that measurement. The following example uses the
procedure to estimate how much space savings we should expect if we were
to enable Page compression on all partitions (@partition_number=NULL)of the clustered index (@index_id=1) on the AdventureWorks2008.Production.Product table:
EXEC sp_estimate_data_compression_savings
@schema_name=Production,
@object_name=Product,
@index_id=1,
@partition_number=NULL,
@data_compression=PAGE;
This statement returns a result similar the following (edited) result:
object_name current_size(KB) compressed_size(KB)
----------- ---------------- -------------------
Product 120 72
Exercise . Estimating the Effect of Compression
Open a new query in SQL Server Management Studio and select AdventureWorks2008 from the list of available databases. Run the sample EXEC sp_estimate_data_compression_savings statement shown above. Try it on different tables. Try changing the @data_compression from PAGE to ROW.
|
The output shows that we
would go from 120KB to 72KB (60 percent of the original size) of storage
for the Product table. That sounds pretty good and gives us the
confidence to enable compression. If, however, the compression estimates
were bleaker, we would probably opt to leave compression off.