SQL Server 2008 : Transact-SQL Programming - The TABLESAMPLE Clause

6/16/2011 3:20:12 PM

The TABLESAMPLE clause lets you query a random sample of data from a table (either an exact number of rows or a percentage of rows). You can use TABLESAMPLE to quickly return a sample from a large table when the sample does not have to be a truly random sample at the level of individual rows. This clause is also useful when you want to test your code against a random subset of data that you copy from a production environment or when you just want to test the validity of your solutions against a subset of data as opposed to the entire data set.

To return a random sample of data using the TABLESAMPLE clause, you specify it in a query’s FROM clause, right after the table name or table alias. The TABLESAMPLE clause has the following syntax:

TABLESAMPLE [SYSTEM] (sample_number [ PERCENT | ROWS ] )
[ REPEATABLE (repeat_seed) ]

Specifying the SYSTEM keyword is optional, but this option is currently the only sampling method available in SQL Server and is applied by default. SYSTEM specifies that an ANSI SQL implementation-dependent sampling method will be used. This means that individual database management system (DBMS) products can implement this method differently. In SQL Server, the same sampling method that it uses to sample data to create statistics is used to generate the results for TABLESAMPLE.

The result set returned by a query using TABLESAMPLE and a specified percentage is created by generating a random value for each physical page in the table. Based on the random value generated for a page, that page is either included in the sample or excluded. When a page is included in the sample, all rows on that page are returned in the result set. For example, if you specify TABLESAMPLE SYSTEM 10 PERCENT, SQL Server returns all the rows from approximately 10% of the randomly selected data pages of the table.

When a specific number of rows rather than a percentage is specified, the requested number of rows is converted into a percentage of the total number of rows in the table and a percentage of the number of pages that should be returned. The TABLESAMPLE operation is then performed against the computed percentage of pages.

If the rows are evenly distributed on the pages of the table, the number of rows returned by a TABLESAMPLE query should be close to the requested sample size. However, if there is a mix of full and sparse pages in the table, the number of rows returned may vary widely for subsequent executions of the query. Consider the following query:

with sales_sample as
( select * from sales TABLESAMPLE (1 percent) )
select count(*) as numrows from sales_sample
go

numrows
-----------
2055

There are 168,715 rows in the sales table in the bigpubs2008 database. A 1% sample should return approximately 1,687 rows. However, as you can see from the preceding example, it returns 2,055 rows.

Note

Each time this query is run, it is likely to return a different set of rows, so your row counts may not match those presented in these examples.

If you invoke the query again, it could return a different number of rows:

with sales_sample as
( select * from sales TABLESAMPLE (1 percent) )
select count(*) as numrows from sales_sample
go

numrows
-----------
1138

Note also that if you specify an actual number of rows, because the sampling is done at the page level, the pages sampled may have more or fewer rows than required to provide the requested sample size. For example, consider the following query, which requests a TABLESAMPLE of 1,000 rows:

with sales_sample as
( select * from sales TABLESAMPLE (1000 rows) )
select count(*) as numrows from sales_sample
go

numrows
-----------
683

A subsequent execution of the same query could return a different number of rows:

with sales_sample as
( select * from sales TABLESAMPLE (1000 rows) )
select count(*) as numrows from sales_sample
go

numrows
-----------
1371

If you run this query repeatedly, you are likely to get a different number of rows every time. However, the larger the table and the greater the number of rows you request, the more likely it is to get a closer percentage or number of requested rows returned. The smaller the table and the smaller the number or percentage of rows you request, the less likely the query is to return the number of rows close to the number or percentage you requested. With very small tables, you might not even get any rows.

To increase the likelihood of receiving the number of rows that you request, you should specify a greater number of rows than you actually need in the TABLESAMPLE clause and use the TOP option to specify the number of rows you actually want. For example, if you want a set of 1,000 random rows, you should request 2,000 rows in the TABLESAMPLE clause and then limit it to 1,000 rows with the TOP option, as in this example:

select top 1000 * from sales TABLESAMPLE (2000 rows)

When you do this, you may still get fewer than 1,000 rows returned, but the likelihood of that occurring is lower than if you request 1,000 rows in the TABLESAMPLE clause alone. Also, by specifying TOP(1000), you’re guaranteed not to get more than 1,000 rows. When you use a combination of TABLESAMPLE and TOP, the data you obtain is a more representative sampling of the data in your table than if you use TOP alone.

If you want to generate the same random sample each time you use the TABLESAMPLE clause, you can specify the REPEATABLE option with a specified repeat_seed value. The REPEATABLE option causes a selected sample to be returned again. When REPEATABLE is specified with the same repeat_seed value, SQL Server returns the same subset of rows, as long as no changes have been made to the table. For example, the following query uses repeat_seed of 1 and, in this case, returns 16,896 rows:

with sales_sample as
( select * from sales TABLESAMPLE (10 percent) repeatable (1) )
select count(*) as numrows from sales_sample
go

numrows
-----------
16896

When REPEATABLE is specified with a different repeat_seed value, SQL Server typically returns a different sample of the rows in the table. For example, the following query uses repeat_seed of 2 and gets a different set and number of rows:

with sales_sample as
( select * from sales TABLESAMPLE (10 percent) repeatable (2) )
select count(*) as numrows from sales_sample
go

numrows
-----------
19856

Running the query again with repeat_seed of 1 returns the same result rows as previously:

with sales_sample as
( select * from sales TABLESAMPLE (10 percent) repeatable (1) )
select count(*) as numrows from sales_sample
go

numrows
-----------
16896

The types of actions that are considered changes and could affect the repeatability of the TABLESAMPLE results include inserts, updates, deletes, index rebuilding, index defragmenting, restoration of a database, and attachment of a database.

You can use other techniques to request random data samples, but most of those techniques require scanning the entire table, which can be time-consuming and I/O intensive for very large tables. Using TABLESAMPLE for a specific table limits the Query Optimizer to performing table scans only on that table, but physical I/Os are performed only on the actual sampled pages included in the result set. Because of this, using TABLESAMPLE is usually a faster way of generating a random sampling of your data.

Other

SQL Server 2008 : Transact-SQL Programming - TRY...CATCH Logic for Error Handling

SQL Server 2008 : Transact-SQL Programming - The APPLY Operator

SQL Server 2008 : Transact-SQL Programming - PIVOT and UNPIVOT

SQL Server 2008 : Transact-SQL Programming - Ranking Functions

SQL Server 2008 : Transact-SQL Programming - Common Table Expressions

SQL Server 2008 : Transact-SQL Programming - The OUTPUT Clause

SQL Server 2008 : Transact-SQL Programming - TOP Enhancements

SQL Server 2008 : Transact-SQL Programming - The max Specifier

SQL Server 2008 : T-SQL Tips and Tricks (part 3) - Generating T-SQL Statements with T-SQL & De-Duping Data with Ranking Functions

SQL Server 2008 : T-SQL Tips and Tricks (part 2) - Using CONTEXT_INFO & Working with Outer Joins