SQL Server 2008 : DBCC validation - Removing corruption

10/28/2013 7:58:34 PM

Despite the existence of DBCC options for repairing corruption, certain repairs occur at the expense of data loss; hence, the inclusion of ALLOW_DATA_LOSS as part of the name of one of the repair options. As such, using this option should be considered a last resort. In this section, let's look at an example in which DBCC discovers corruption and investigate the scope of potential data loss as well as the options available for recovery.

1. Interpreting DBCC output

Consider the following (abbreviated) output from a CHECKDB operation:

DBCC results for 'AdventureWorks'.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0
 (type Unknown), page ID (1:676) contains an incorrect page ID in its
 page header. The PageId in the page header = (0:0).
CHECKDB found 0 allocation errors and 1 consistency errors not associated
 with any single object.
DBCC results for 'sys.sysrscols'.
DBCC results for 'Person.Address'.
There are 19614 rows in 280 pages for object "Person.Address".
DBCC results for 'testtable'.
Msg 8928, Level 16, State 1, Line 1
Object ID 87671360, index ID 0, partition ID 72057594058244096, alloc unit
 ID 72057594062635008 (type In-row data): Page (1:676) could not be
 processed.  See other errors for details.
Msg 8928, Level 16, State 1, Line 1
Object ID 87671360, index ID 2, partition ID 72057594058309632, alloc unit
 ID 72057594062700544 (type In-row data): Page (1:800) could not be
processed.  See other errors for details.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 87671360, index ID 2, partition ID 72057594058309632,
 alloc unit ID 72057594062700544 (type In-row data), page
 (1:800). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are
 12716041 and -4.
There are 2 rows in 2 pages for object "testtable".
CHECKDB found 0 allocation errors and 3 consistency errors in table
 'testtable' (object ID 87671360).
CHECKDB found 0 allocation errors and 4 consistency errors in database
 'AdventureWorks'.
repair_allow_data_loss is the minimum repair level for the errors found by
 DBCC CHECKDB (AdventureWorks).
DBCC execution completed. If DBCC printed error messages, contact your
 system administrator.

There are a few things to note here. Near the end of the output is the line repair_allow_data_loss is the minimum repair level... This essentially means that corruption was found on a clustered index (or a heap).

Looking further up in the output, we can see error messages coming from Object ID 87671360, index ID 0, and index ID 2. Tables without a clustered index are referred to as a heap. Index ID 0 refers to a base data page from a heap table. Index ID 1 refers to a clustered index page, and index ID 2 and above refer to pages from nonclustered indexes. When interpreting the DBCC output, seeing corruptions only from index IDs 2 and above is somewhat good news; it means that the only corruption is on nonclustered index pages. In such cases, recovery is quite straightforward; we can simply rebuild the index(es), or proceed with the REPAIR_REBUILD option of DBCC, which will reinstate the missing/corrupted rows in the nonclustered index (or rebuild it). Neither of these options will result in any data loss.

In our case, we have errors from index IDs less than 2. As a result, DBCC is suggesting that the repair_allow_data_loss option is the minimum repair level. This doesn't mean that we should run with this option. As the name implies, it will result in data loss, so we need to think through our options. We'll cover the recovery options shortly. Before doing so, let's look at a way of inspecting the extent of potential data loss.

2. Determining the extent of data loss with DBCC PAGE

One of the DBCC commands that we haven't spoken of yet is DBCC PAGE, an undocumented (and therefore unsupported) command. If a database page is accessible, DBCC PAGE can be used to inspect its contents. In certain corruption scenarios, this can be very useful in determining the extent of damage. Consider a case where a range of clustered index (data) pages are corrupted—for example, pages 98 through 118. By inspecting the pages either side of the corruption range, 97 and 119 in this case, we'll get a much better idea of the extent of damage.

Figure 1 shows the output of the DBCC PAGE command. Before running the command, we turn on trace flag 3604 to enable the output to be displayed to screen.

Figure 1. Using the undocumented (and unsupported) DBCC PAGE command to inspect the contents of a database page

As figure 1 shows, DBCC PAGE will return data from the page that we can use to determine the contents, and thus the potential data loss—an important factor in deciding on an appropriate recovery option.

^[] The four parameters for DBCC PAGE are database name, file number, page number, and print option. Print option 3 includes page header information and row details.

3. Recovery options

To recap, validating the I/O system with SQLIOSIM before production implementation and ensuring page checksums are enabled are crucial steps in avoiding a scenario in which a large amount of data is corrupted before being discovered. SQLIOSIM will ensure the I/O system is valid and reliable for SQL Server use, and using page checksums in combination with regular event log monitoring helps you identify corrupted pages early, hopefully before the corruption becomes widespread.

Of course, neither of these best practices is a guarantee that we'll never have to deal with corrupted data, so knowing the available recovery options is an important step in preparing for the unexpected. Let's walk through the major physical corruption recovery options, beginning with corrupted nonclustered indexes.

Rebuilding nonclustered indexes

As we saw earlier, if the only corruption found was in nonclustered indexes (index ID 2 and above), then we can use the REPAIR_REBUILD option, or we can simply re-create the index. If the corruption is on a clustered index, we're not as lucky, and the restoration of a recent backup becomes our best option. Fortunately, we're able to restore individual pages.

Page restore

The importance of regular, validated backups can't be overstated, particularly in corruption situations. In some cases, the only way out of a corruption scenario is to restore a backup. The only thing worse than discovering corruption is to then discover the backups are invalid (or don't exist!). Performing regular backups with the WITH CHECKSUM clause, together with regular monitoring for page checksum failures, provides the best chance of detecting corruption early and having reliable backups for recovery. Depending on the scale of corruption, the availability of recent backups may enable the use of the page restore technique.

Listing 1 shows an example script to restore two pages. Like a filegroup restore, we follow the first restore with one or more transaction log restores, which apply changes made to the pages since the full backup was taken. After the first two restores, we take an additional transaction log backup to ensure all changes to the pages are captured and restored. Finally, we perform all restores with the exception of the last using the WITH NORECOVERY option to enable subsequent restores to occur.

Example 1. Page restore

-- Restore an individual page from a full backup file
-- Restore in NORECOVERY mode to allow subsequent t-log roll forwards

RESTORE DATABASE [AdventureWorks2008]
   PAGE='1:676, 1:800'
   FROM DISK = 'G:\SQL Backup\AdventureWorks.bak'
   WITH NORECOVERY;

RESTORE LOG [AdventureWorks2008]
   FROM DISK = 'G:\SQL Backup\AdventureWorks-Trn.bak'
   WITH NORECOVERY;

BACKUP LOG [AdventureWorks2008]
   TO DISK = 'G:\SQL Backup\AdventureWorks_20080718_0915_log.bak'

RESTORE LOG [AdventureWorks2008]
   FROM DISK = 'G:\SQL Backup\AdventureWorks_20080718_0915_log.bak'
   WITH RECOVERY;
GO

Note that page restore is still possible in the non-Enterprise editions of SQL Server, but the database can't be online during the restore process. Further, page restores aren't possible for the transaction log and certain pages of the database: the GAM and SGAM pages, page 0 (the file boot page), and page 1:9 (the database boot page). Finally, as with other online restores, an unbroken sequence of transaction log backups is required.

Of course, if the full backup used for the restore also contains corruption on the page, then that's obviously of no help. Again, active event log monitoring with page checksums in place is crucial in avoiding this situation by identifying corruption as soon as possible.

If a valid backup isn't available for recovery, then as a last resort, the REPAIR_ALLOW_DATA_LOSS option can be used, after acknowledging that data will be lost as a result.

REPAIR_ALLOW_DATA_LOSS

Before running the REPAIR_ALLOW_DATA_LOSS command, it's worth either making a backup of the database or creating a snapshot. If the repair doesn't yield the required results, then the database can be restored or reverted to the snapshot. Alternatively (or as well as), the repair can be performed in a user-defined transaction, with a rollback statement undoing the repair if appropriate.

If the repair operation completes with the desired result, you should run DBCC CHECKCONSTRAINTS, particularly if the repaired object was a table involved in foreign key relationships or had other check constraints in place. Additional business logic checking should be performed where possible to make sure the effects of the repair won't cause unexpected problems at a later point.

What CHECKDB can't repair

The REPAIR_ALLOW_DATA_LOSS option isn't capable of repairing certain types of corruptions (or it doesn't make sense to try), notably the PFS (Page Free Space) page, critical system tables, and corrupted column range values. In such cases, backups will be required for recovery purposes.

After any unexpected recovery situation that results from corruption, perform a root cause analysis, discussed next, as soon as possible.

4. Root cause analysis

In most cases, corruption will most likely be the result of a faulty I/O component. We've already covered the importance of using SQLIOSIM to validate the I/O system before a server is implemented in production. If it passes validation, that doesn't preclude it from future problems; it just means that the I/O system is valid at that moment.

Following a corruption event, it's absolutely crucial that you perform a thorough analysis of the events leading up to the error. The usual suspects come into play here, such as Windows event logs, SQL Server error logs, and I/O software logs. If you suspect an I/O problem but can't pinpoint it, consider rerunning SQLIOSIM. If any weakness exists in the I/O, SQLIOSIM will more than likely find it.

A thorough post-restore root-cause analysis is essential in limiting the likelihood of further corruption events. Finally, if the required backups weren't available as part of the recovery process, now would be a good time to ensure this situation is addressed!

Other

SQL Server 2008 : Mirroring in action (part 3) - Suspending and resuming mirroring, Initiating failover

SQL Server 2008 : Mirroring in action (part 2) - Monitoring database mirroring

SQL Server 2008 : Mirroring in action (part 1) - Mirroring setup

SQL Server 2008 : High availability with database mirroring - Failover options

Programming ASP.NET 3.5 : Data Source-Based Data Binding (part 4) - Iterative Controls

Programming ASP.NET 3.5 : Data Source-Based Data Binding (part 3) - List Controls

Programming ASP.NET 3.5 : Data Source-Based Data Binding (part 2) - Data-Binding Properties

Programming ASP.NET 3.5 : Data Source-Based Data Binding (part 1) - Feasible Data Sources

SQL Server 2008 : Database mirroring overview, Mirroring modes

SQL Server 2008 : Transaction log shipping - Usage scenarios, Setting up and monitoring log shipping