The TABLESAMPLE clause
lets you query a random sample of data from a table (either an exact
number of rows or a percentage of rows). You can use TABLESAMPLE
to quickly return a sample from a large table when the sample does not
have to be a truly random sample at the level of individual rows. This
clause is also useful when you want to test your code against a random
subset of data that you copy from a production environment or when you
just want to test the validity of your solutions against a subset of
data as opposed to the entire data set.
To return a random sample of data using the TABLESAMPLE clause, you specify it in a query’s FROM clause, right after the table name or table alias. The TABLESAMPLE clause has the following syntax:
TABLESAMPLE [SYSTEM] (sample_number [ PERCENT | ROWS ] )
[ REPEATABLE (repeat_seed) ]
Specifying the SYSTEM keyword is optional, but this option is currently the only sampling method available in SQL Server and is applied by default. SYSTEM
specifies that an ANSI SQL implementation-dependent sampling method
will be used. This means that individual database management system
(DBMS) products can implement this method differently. In SQL Server,
the same sampling method that it uses to sample data to create
statistics is used to generate the results for TABLESAMPLE.
The result set returned by a query using TABLESAMPLE
and a specified percentage is created by generating a random value for
each physical page in the table. Based on the random value generated for
a page, that page is either included in the sample or excluded. When a
page is included in the sample, all rows on that page are returned in
the result set. For example, if you specify TABLESAMPLE SYSTEM 10 PERCENT, SQL Server returns all the rows from approximately 10% of the randomly selected data pages of the table.
When a specific number of rows
rather than a percentage is specified, the requested number of rows is
converted into a percentage of the total number of rows in the table and
a percentage of the number of pages that should be returned. The TABLESAMPLE operation is then performed against the computed percentage of pages.
If the rows are evenly distributed on the pages of the table, the number of rows returned by a TABLESAMPLE
query should be close to the requested sample size. However, if there
is a mix of full and sparse pages in the table, the number of rows
returned may vary widely for subsequent executions of the query.
Consider the following query:
with sales_sample as
( select * from sales TABLESAMPLE (1 percent) )
select count(*) as numrows from sales_sample
go
numrows
-----------
2055
There are 168,715 rows in the sales table in the bigpubs2008
database. A 1% sample should return approximately 1,687 rows. However,
as you can see from the preceding example, it returns 2,055 rows.
Note
Each time this query is run,
it is likely to return a different set of rows, so your row counts may
not match those presented in these examples.
If you invoke the query again, it could return a different number of rows:
with sales_sample as
( select * from sales TABLESAMPLE (1 percent) )
select count(*) as numrows from sales_sample
go
numrows
-----------
1138
Note also that if you
specify an actual number of rows, because the sampling is done at the
page level, the pages sampled may have more or fewer rows than required
to provide the requested sample size. For example, consider the
following query, which requests a TABLESAMPLE of 1,000 rows:
with sales_sample as
( select * from sales TABLESAMPLE (1000 rows) )
select count(*) as numrows from sales_sample
go
numrows
-----------
683
A subsequent execution of the same query could return a different number of rows:
with sales_sample as
( select * from sales TABLESAMPLE (1000 rows) )
select count(*) as numrows from sales_sample
go
numrows
-----------
1371
If you run this query
repeatedly, you are likely to get a different number of rows every time.
However, the larger the table and the greater the number of rows you
request, the more likely it is to get a closer percentage or number of
requested rows returned. The smaller the table and the smaller the
number or percentage of rows you request, the less likely the query is
to return the number of rows close to the number or percentage you
requested. With very small tables, you might not even get any rows.
To increase the
likelihood of receiving the number of rows that you request, you should
specify a greater number of rows than you actually need in the TABLESAMPLE clause and use the TOP
option to specify the number of rows you actually want. For example, if
you want a set of 1,000 random rows, you should request 2,000 rows in
the TABLESAMPLE clause and then limit it to 1,000 rows with the TOP option, as in this example:
select top 1000 * from sales TABLESAMPLE (2000 rows)
When you do this, you may still
get fewer than 1,000 rows returned, but the likelihood of that occurring
is lower than if you request 1,000 rows in the TABLESAMPLE clause alone. Also, by specifying TOP(1000), you’re guaranteed not to get more than 1,000 rows. When you use a combination of TABLESAMPLE and TOP, the data you obtain is a more representative sampling of the data in your table than if you use TOP alone.
If you want to generate the same random sample each time you use the TABLESAMPLE clause, you can specify the REPEATABLE option with a specified repeat_seed value. The REPEATABLE option causes a selected sample to be returned again. When REPEATABLE is specified with the same repeat_seed
value, SQL Server returns the same subset of rows, as long as no
changes have been made to the table. For example, the following query
uses repeat_seed of 1 and, in this case, returns 16,896 rows:
with sales_sample as
( select * from sales TABLESAMPLE (10 percent) repeatable (1) )
select count(*) as numrows from sales_sample
go
numrows
-----------
16896
When REPEATABLE is specified with a different repeat_seed value, SQL Server typically returns a different sample of the rows in the table. For example, the following query uses repeat_seed of 2 and gets a different set and number of rows:
with sales_sample as
( select * from sales TABLESAMPLE (10 percent) repeatable (2) )
select count(*) as numrows from sales_sample
go
numrows
-----------
19856
Running the query again with repeat_seed of 1 returns the same result rows as previously:
with sales_sample as
( select * from sales TABLESAMPLE (10 percent) repeatable (1) )
select count(*) as numrows from sales_sample
go
numrows
-----------
16896
The types of actions that are considered changes and could affect the repeatability of the TABLESAMPLE
results include inserts, updates, deletes, index rebuilding, index
defragmenting, restoration of a database, and attachment of a database.
You can use other
techniques to request random data samples, but most of those techniques
require scanning the entire table, which can be time-consuming and I/O
intensive for very large tables. Using TABLESAMPLE
for a specific table limits the Query Optimizer to performing table
scans only on that table, but physical I/Os are performed only on the
actual sampled pages included in the result set. Because of this, using TABLESAMPLE is usually a faster way of generating a random sampling of your data.