Despite the existence of DBCC options for
repairing corruption, certain repairs occur at the expense of data
loss; hence, the inclusion of ALLOW_DATA_LOSS as part of the
name of one of the repair options. As such, using this option should be
considered a last resort. In this section, let's look at an example in
which DBCC discovers corruption and investigate the scope of potential
data loss as well as the options available for recovery.
1. Interpreting DBCC output
Consider the following (abbreviated) output from a CHECKDB operation:
DBCC results for 'AdventureWorks'.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0
(type Unknown), page ID (1:676) contains an incorrect page ID in its
page header. The PageId in the page header = (0:0).
CHECKDB found 0 allocation errors and 1 consistency errors not associated
with any single object.
DBCC results for 'sys.sysrscols'.
DBCC results for 'Person.Address'.
There are 19614 rows in 280 pages for object "Person.Address".
DBCC results for 'testtable'.
Msg 8928, Level 16, State 1, Line 1
Object ID 87671360, index ID 0, partition ID 72057594058244096, alloc unit
ID 72057594062635008 (type In-row data): Page (1:676) could not be
processed. See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 87671360, index ID 2, partition ID 72057594058309632, alloc unit
ID 72057594062700544 (type In-row data): Page (1:800) could not be
processed. See other errors for details.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 87671360, index ID 2, partition ID 72057594058309632,
alloc unit ID 72057594062700544 (type In-row data), page
(1:800). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are
12716041 and -4.
There are 2 rows in 2 pages for object "testtable".
CHECKDB found 0 allocation errors and 3 consistency errors in table
'testtable' (object ID 87671360).
CHECKDB found 0 allocation errors and 4 consistency errors in database
'AdventureWorks'.
repair_allow_data_loss is the minimum repair level for the errors found by
DBCC CHECKDB (AdventureWorks).
DBCC execution completed. If DBCC printed error messages, contact your
system administrator.
There are a few things to note here. Near the end of the output is the line repair_allow_data_loss is the minimum repair level...
This essentially means that corruption was found on a clustered index
(or a heap).
Looking
further up in the output, we can see error messages coming from Object
ID 87671360, index ID 0, and index ID 2. Tables without a clustered
index are referred to as a heap.
Index ID 0 refers to a base data page from a heap table. Index ID 1
refers to a clustered index page, and index ID 2 and above refer to
pages from nonclustered indexes. When interpreting the DBCC output,
seeing corruptions only from index IDs 2 and above is somewhat good
news; it means that the only corruption is on nonclustered index pages.
In such cases, recovery is quite straightforward; we can simply rebuild
the index(es), or proceed with the REPAIR_REBUILD option of
DBCC, which will reinstate the missing/corrupted rows in the
nonclustered index (or rebuild it). Neither of these options will
result in any data loss.
In our case, we have errors from index IDs less than 2. As a result, DBCC is suggesting that the repair_allow_data_loss option is the minimum repair level. This doesn't mean that we should
run with this option. As the name implies, it will result in data loss,
so we need to think through our options. We'll cover the recovery
options shortly. Before doing so, let's look at a way of inspecting the
extent of potential data loss.
2. Determining the extent of data loss with DBCC PAGE
One of the DBCC commands that we haven't spoken of yet is DBCC PAGE, an undocumented (and therefore unsupported) command. If a database page is accessible, DBCC PAGE
can be used to inspect its contents. In certain corruption scenarios,
this can be very useful in determining the extent of damage. Consider a
case where a range of clustered index (data) pages are corrupted—for
example, pages 98 through 118. By inspecting the pages either side of
the corruption range, 97 and 119 in this case, we'll get a much better
idea of the extent of damage.
Figure 1 shows the output of the DBCC PAGE command. Before running the command, we turn on trace flag 3604 to enable the output to be displayed to screen.
As figure 1 shows, DBCC PAGE
will return data from the page that we can use to determine the
contents, and thus the potential data loss—an important factor in
deciding on an appropriate recovery option.
3. Recovery options
To
recap, validating the I/O system with SQLIOSIM before production
implementation and ensuring page checksums are enabled are crucial
steps in avoiding a scenario in which a large amount of data is
corrupted before being discovered. SQLIOSIM will ensure the I/O system
is valid and reliable for SQL Server use, and using page checksums in
combination with regular event log monitoring helps you identify
corrupted pages early, hopefully before the corruption becomes
widespread.
Of course, neither of these
best practices is a guarantee that we'll never have to deal with
corrupted data, so knowing the available recovery options is an
important step in preparing for the unexpected. Let's walk through the
major physical corruption recovery options, beginning with corrupted
nonclustered indexes.
Rebuilding nonclustered indexes
As we saw earlier, if the only corruption found was in nonclustered indexes (index ID 2 and above), then we can use the REPAIR_REBUILD
option, or we can simply re-create the index. If the corruption is on a
clustered index, we're not as lucky, and the restoration of a recent
backup becomes our best option. Fortunately, we're able to restore
individual pages.
Page restore
The
importance of regular, validated backups can't be overstated,
particularly in corruption situations. In some cases, the only way out
of a corruption scenario is to restore a backup. The only thing worse
than discovering corruption is to then discover the backups are invalid
(or don't exist!). Performing regular backups with the WITH CHECKSUM
clause, together with regular monitoring for page checksum failures,
provides the best chance of detecting corruption early and having
reliable backups for recovery. Depending on the scale of corruption,
the availability of recent backups may enable the use of the page
restore technique.
Listing 1
shows an example script to restore two pages. Like a filegroup restore,
we follow the first restore with one or more transaction log restores,
which apply changes made to the pages since the full backup was taken.
After the first two restores, we take an additional transaction log
backup to ensure all changes to the pages are captured and restored.
Finally, we perform all restores with the exception of the last using
the WITH NORECOVERY option to enable subsequent restores to occur.
Example 1. Page restore
-- Restore an individual page from a full backup file -- Restore in NORECOVERY mode to allow subsequent t-log roll forwards
RESTORE DATABASE [AdventureWorks2008] PAGE='1:676, 1:800' FROM DISK = 'G:\SQL Backup\AdventureWorks.bak' WITH NORECOVERY;
RESTORE LOG [AdventureWorks2008] FROM DISK = 'G:\SQL Backup\AdventureWorks-Trn.bak' WITH NORECOVERY;
BACKUP LOG [AdventureWorks2008] TO DISK = 'G:\SQL Backup\AdventureWorks_20080718_0915_log.bak'
RESTORE LOG [AdventureWorks2008] FROM DISK = 'G:\SQL Backup\AdventureWorks_20080718_0915_log.bak' WITH RECOVERY; GO
|
Note
that page restore is still possible in the non-Enterprise editions of
SQL Server, but the database can't be online during the restore
process. Further, page restores aren't possible for the transaction log
and certain pages of the database: the GAM and SGAM pages, page 0 (the
file boot page), and page 1:9 (the database boot page). Finally, as
with other online restores, an unbroken sequence of transaction log
backups is required.
Of course, if the
full backup used for the restore also contains corruption on the page,
then that's obviously of no help. Again, active event log monitoring
with page checksums in place is crucial in avoiding this situation by
identifying corruption as soon as possible.
If a valid backup isn't available for recovery, then as a last resort, the REPAIR_ALLOW_DATA_LOSS option can be used, after acknowledging that data will be lost as a result.
REPAIR_ALLOW_DATA_LOSS
Before running the REPAIR_ALLOW_DATA_LOSS
command, it's worth either making a backup of the database or creating
a snapshot. If the repair doesn't yield the required results, then the
database can be restored or reverted to the snapshot. Alternatively (or
as well as), the repair can be performed in a user-defined transaction,
with a rollback statement undoing the repair if appropriate.
If the repair operation completes with the desired result, you should run DBCC CHECKCONSTRAINTS,
particularly if the repaired object was a table involved in foreign key
relationships or had other check constraints in place. Additional
business logic checking should be performed where possible to make sure
the effects of the repair won't cause unexpected problems at a later
point.
The REPAIR_ALLOW_DATA_LOSS
option isn't capable of repairing certain types of corruptions (or it
doesn't make sense to try), notably the PFS (Page Free Space) page,
critical system tables, and corrupted column range values. In such
cases, backups will be required for recovery purposes.
|
After
any unexpected recovery situation that results from corruption, perform
a root cause analysis, discussed next, as soon as possible.
4. Root cause analysis
In
most cases, corruption will most likely be the result of a faulty I/O
component. We've already covered the importance of using SQLIOSIM to
validate the I/O system before a server is implemented in production.
If it passes validation, that doesn't preclude it from future problems;
it just means that the I/O system is valid at that moment.
Following
a corruption event, it's absolutely crucial that you perform a thorough
analysis of the events leading up to the error. The usual suspects come
into play here, such as Windows event logs, SQL Server error logs, and
I/O software logs. If you suspect an I/O problem but can't pinpoint it,
consider rerunning SQLIOSIM. If any weakness exists in the I/O,
SQLIOSIM will more than likely find it.
A
thorough post-restore root-cause analysis is essential in limiting the
likelihood of further corruption events. Finally, if the required
backups weren't available as part of the recovery process, now would be
a good time to ensure this situation is addressed!