SQL Server 2008 : Explaining Advanced Query Techniques - Creating and Altering Tables

1/31/2011 5:15:10 PM

In this section we look at new features in SQL Server 2008 for more efficiently storing columns with a large number of NULL values as well as for compressing the data in a table.

Sparse Columns

Sparse columns make it possible to store columns more efficiently where a large percentage of the rows will be NULL. Often, we need to include columns in our tables that only occasionally have values actually supplied for them. A column that stores an optional middle initial is an example. Most people don’t bother to include their middle initial when entering their names, but occasionally somebody will. The large percentage of rows that don’t have a value for the middle initial may then be NULL. By declaring the middle initial column with the SPARSE keyword, SQL Server does not consume any physical storage for the column in all rows where the column value is NULL.

There is a trade-off, of course. The columns that do contain values actually take more storage than nonsparse columns of the same data type. This means that there is a break-even point we must cross in terms of the percentage of rows that must contain NULL in that column for any actual space savings to occur. For example, the SQL Server documentation states that a nonsparse integer column that contains a value takes four bytes of storage, whereas an integer column marked as sparse requires eight bytes to store the same value. That is double the cost for each non-null integer value stored. If you do that math, then more than 50% of the rows must be NULL for there to be any space savings. The SQL Server documentation actually states that 64% of the rows have to be NULL to realize a savings. If that is the case, however, the savings could be worth it. All data types can be sparse with the exception of geography, geometry, image, ntext, text, timestamp, and user-defined data types.

If a table has more than one sparse column, we can work with all of the sparse columns using a Column Set. A Column Set is an XML column that allows us to work with all the sparse columns in the same table as a single entity. If there are a large number of sparse columns, it may be more convenient or efficient to treat them as a single set of data rather than as individual columns. For a large number of rows, working with the values as a Column Set could provide a performance increase as well. We include a Column Set by adding an XML column to the table’s definition and then marking that column with COLUMN_SET FOR ALL_SPARSE_COLUMNS. We can still address the sparse columns directly by name, but SQL Server defaults to showing only the Column Set column, for example, when we do a SELECT*.

Sparse columns are good candidates for inclusion in filtered indexes. A filtered index is a special index that includes only rows that match the criteria specified in a WHERE clause that is part of the CREATE INDEX statement. Since sparse columns by definition should contain mostly NULL values, filtered indexes on the non-null values in sparse columns would only have to index a small percentage of the rows.

The following sample code creates a table to store demographic details we hope to collect from a customer survey. It is likely that the individuals completing the survey will leave a large number of fields empty. We want to optimize the storage of those empty (NULL) values by using sparse columns. Notice that the MaritalStatus column has not been marked sparse. The documentation states that 98 percent of the rows would have to be NULL for us to realize a space savings on the bit data type by marking it SPARSE. More than 2 percent of the surveys will likely be received with the MaritalStatus column completed, so we won’t mark it as SPARSE. We will include a Column Set definition to allow us to work with all the demographic values as a single set of XML data, and finally we will create a filtered index on the income field to help us quickly find customers based on their income:

USE AdventureWorks2008;
GO
CREATE TABLE Person.Demographics
(
   BusinessEntityID int NOT NULL PRIMARY KEY,
   Age int SPARSE,
   BirthDate date SPARSE,
   Income int SPARSE,
   Cars int SPARSE,
   Children int SPARSE,
   MaritalStatus bit NULL,
   AllFields xml COLUMN_SET FOR ALL_SPARSE_COLUMNS
);
CREATE NONCLUSTERED INDEX FIDemographicsIncomes
ON Person.Demographics(Income)
WHERE Income IS NOT NULL;

The main advantage of using sparse columns is space savings. Smaller rows means fewer rows per page, fewer pages per table, less IO per query, less buffer cache per query, and so on. SQL Server 2008 now also includes the ability for us to compress the rows and pages of our tables explicitly. The next session talks about just that.

Compression

SQL Server 2008 is the first version of SQL Server that supports internal data compression. We have been able to use NTFS compression in previous versions to compress read-only filegroups and partitions. With SQL Server 2008, we can have the data engine itself perform compression at either the row or page level, and we can define those compression settings on a partition-by-partition basis. It should be noted that SQL Server 2008 also supports backup compression, but that is beyond the scope of this book. In this section we will talk specifically about Row and Page compression in our tables and indexes.

As with sparse columns, the big benefit of compression is a savings in storage costs. If we can compress our rows and pages, we can fit more rows per page and reduce the amount of space it takes to store the same amount of data. That storage cost savings can then translate into reduced performance costs because it requires less IO to read those pages. Of course, we don’t get it for free. Compression will increase CPU costs as we compress and decompress the data. So as it is with many other things in SQL Server, there are times to compress and times not to compress.

A number of compression algorithms are available today, but two common practices to avoid taking space for empty values are to store patterns of data that repeat and to store them a single time. SQL Server can use both of these methods through its Row and Page compression options.

Row compression takes the approach of not storing “empty” bytes. It effectively treats all column values as if they were variable width. This can save a lot of space if you have columns that aren’t completely full. It doesn’t take an excessive amount of CPU usage to effect Row compression, so if it gives you enough of a storage savings, you won’t need Page compression.

Page compression implies Row compression, but it also looks for and reduces the occurrence of redundant data. A lot more work is involved in Page compression, and so the CPU utilization will be significantly higher than that of just Row compression, but with enough redundant bytes of data, it could yield much higher space savings.

Deciding whether to use Row or Page compression comes down to a question of the amount of space saved and the amount of CPU cost to get that savings. If Row compression provides acceptable savings in space, then stay with that because it will be cheaper than Page compression. If Row compression does not offer enough space savings, however, we can then step up to Page compression.

Warning

The Row and Page compression features are available only in the Enterprise, Evaluation, and Developer editions. Watch out for questions that try to trick you into the wrong answer based on the edition.

Row Compression

Row compression effectively asks that SQL Server store all fields as if they were variable length. By effectively treating every data type as if it were variable length, it is possible that we can realize an overall reduction in storage. SQL Server 2005 SP2 introduced a table option called vardecimal. By setting this option on a table, you could effectively store all fields with the decimal and numeric data types as if they were variable length. The new Row compression option supersedes the vardecimal behavior and extends the variable-width storage to all data types.

For example, imagine a table with a char(10) field for storing postal codes. We need all ten characters for rows where the postal code uses the “zip + 4” format (00000–0000). For the rows that have only five characters in their postal code, however, the extra five characters are wasted space. By storing the characters as a variable-width field, SQL Server could save five bytes of storage for the shorter postal codes. The variable-width data, however, has some overhead associated with it because we need a map, stored as a series of offsets, of where the data for each field is. That means that if the fields don’t really vary in length (they all use “zip + 4”), it is less efficient to store them as variable-width data than as fixed width.

We indicate that a table or an index should be compressed via the respective CREATE and ALTER statements. The following example creates a table using Row compression:

CREATE TABLE Student
(
   SomeID int IDENTITY(1,1) PRIMARY KEY,
   FirstName nvarchar(50) NOT NULL,
   LastName nvarchar(50) NOT NULL,
   Age int NOT NULL,
   ModifiedDate datetime NOT NULL,
) WITH (DATA_COMPRESSION=ROW);

In this example, the storage of the FirstName and LastName fields does not change much because they are based on nvarchar, but other int and datetime fields can now also be stored as variable width because the Row compression option was enabled.

Page Compression

Row compression does not store empty bytes on the page. Page compression, however, actually identifies and reduces the occurrences of repeating patterns in the columns row on the page. Page compression implies Row compression, and then adds column Prefix and Page Dictionary compression methods.

In column Prefix compression, SQL Server identifies the single most common set of bytes at the beginning (from left to write) of the column values for all rows of a given column. It then stores that common set of bytes for each column in a “Compression Information” record in the page. The column values in the rows can then reference the single occurrence of the common byte sequence rather than storing it repeatedly for each row where it occurred. Rows that began with all or even some of the common bytes can be compressed based on the number of bytes they shared with the common prefix. Rows that did not begin with any or all of the common set of bytes are not compressed at all.

Since Row compression is done on a single column at a time, it can only compress data when multiple values in a column start with the same bytes. For example, if a table had a single column of first name values and only ten rows, and if each first name started with a different character, Row compression would have no benefit because there would be no common byte patterns at the beginning of the column values. With a larger number of rows, and an initial sequence of characters shared across a large number of rows, then Row compression could have a significant impact on the total storage. To find repeating patterns of data across all rows and columns, SQL Server performs Page Dictionary compression.

In Page Dictionary compression, SQL Server creates a dictionary (or list) of all common byte patterns across rows and columns on the page. It is no longer looking for only common prefixes on a column-by-column basis. Instead it is looking for all byte patterns that are repeated in any column of any row. It stores this list of common patterns in the same “Compression Information” structure in which it keeps the column prefix values. It then replaces any occurrences of those common byte patterns in the columns of the rows on the page, with a reference to the patterns position in the dictionary.

The following example alters the table we made in the Row compression example and changes it to Page compression:

ALTER TABLE Student
REBUILD WITH (DATA_COMPRESSION=PAGE);

Individual index and table partitions can have different compression settings. We can opt on a partition-by-partition basis to have either no compression, Row compression, or Page compression. If we have a table partitioned by time, for example, we could leave the partition with the current month’s data uncompressed, use Row compression on the partitions with the month’s data, and finally use Page compression on the partition with all previous data in it. The following example shows the syntax for compressing a table with multiple partitions.

CREATE TABLE TimePartitioned
(SomeKey int, SomeValue nvarchar(50), TheDate date)
ON TimeScheme(TheDate)
WITH
(
   DATA_COMPRESSION = NONE ON PARTITIONS (1),
   DATA_COMPRESSION = ROW ON PARTITIONS (2),
   DATA_COMPRESSION = PAGE ON PARTITIONS (3),
);

Turning compression on will have some implied costs in terms of both the ongoing costs to maintain compression and the initial costs to compress data already in the table or index. We should therefore first verify that we will get a benefit from compression before we enable compression.

Estimating Compression Savings

We can estimate how much space compression should save us before we actually enable it. SQL Server provides a system stored procedure named sp_estimate_data_compression_savings. The procedure takes a sampling of the actual pages in the target table, or index(es), performs compression on those pages, and measures the space saved. It then estimates what the compression would be for all pages based on that measurement. The following example uses the procedure to estimate how much space savings we should expect if we were to enable Page compression on all partitions (@partition_number=NULL)of the clustered index (@index_id=1) on the AdventureWorks2008.Production.Product table:

EXEC sp_estimate_data_compression_savings
   @schema_name=Production,
   @object_name=Product,
   @index_id=1,
   @partition_number=NULL,
   @data_compression=PAGE;

This statement returns a result similar the following (edited) result:

object_name      current_size(KB)      compressed_size(KB)
-----------      ----------------      -------------------
Product          120                   72

Exercise . Estimating the Effect of Compression

Open a new query in SQL Server Management Studio and select AdventureWorks2008 from the list of available databases.
Run the sample EXEC sp_estimate_data_compression_savings statement shown above.
Try it on different tables.
Try changing the @data_compression from PAGE to ROW.

The output shows that we would go from 120KB to 72KB (60 percent of the original size) of storage for the Product table. That sounds pretty good and gives us the confidence to enable compression. If, however, the compression estimates were bleaker, we would probably opt to leave compression off.

Other

SQL Server 2008 : Explaining Advanced Query Techniques - Managing Internationalization Considerations

Programming Microsoft SQL Server 2005 : Deployment (part 2) - Testing Your Stored Procedures

Programming Microsoft SQL Server 2005 : Deployment (part 1) - Deploying Your Assembly

Programming Microsoft SQL Server 2005 : CLR Stored Procedures and Server-Side Data Access

SQL Server 2008 : Explaining Advanced Query Techniques - Controlling Execution Plans (part 3) - Using the Resource Governor

SQL Server 2008 : Explaining Advanced Query Techniques - Controlling Execution Plans (part 2)

SQL Server 2008 : Explaining Advanced Query Techniques - Controlling Execution Plans (part 1)

Reporting Services with SQL Azure : Deploying the Report & Creating a Subreport

Reporting Services with SQL Azure : Creating the Report Design

SQL Server 2008 : Explaining Advanced Query Techniques - Applying Ranking Functions (part 2) - Using RANK, DENSE_RANK and NTILE