An Exchange infrastructure that uses Exchange Native Data Protection is an infrastructure in which you don't need to perform traditional point-in-time backups.
The intention to move toward an infrastructure without traditional point-in-time
backups is mainly cost-driven—you can save the money for expensive
backup solutions and storage facilities for tapes or disks. To deploy
an infrastructure without traditional point-in-time backups, you need to consider the following:
Create multiple
database copies using DAGs to protect your database by having it copied
to different Exchange servers—maybe even to different datacenters—to
reduce the risk of losing a database because of a malfunction such as a
disk crash.
Note: As
a best practice, you should create at least three database copies if
you store these copies on just a bunch of disks (JBOD); at least two
database copies are required if you are using additional data
protection on your disks such as RAID. Lagged database copies should be
considered additionally.
Provide
deleted item recovery by implementing Single Item Recovery and hold
policy so that you can recover changed or deleted items on user
request. Traditionally, recovering items required a brick-level backup
or a full database to be restored that you don't need anymore.
To
protect your databases from logical corruption, implement a lagged
database that replays log files after a delay of up to 14 days.
If you consider implementing
all these areas, you're on the best way to implementing a backup-less
infrastructure. However, you should not forget about Public
Folders, which are not covered in this concept. Public Folder databases
can be replicated to multiple servers; thus you maintain multiple
copies of them in your environment. But what happens if somebody
deletes a Public Folder item or folder by mistake? As mentioned in the
previous section, you can use deleted item retention to recover deleted
items or folders. The big question here is whether that solution is
sufficient for your organization. If your answer to that question is
no, you might still need a third-party backup solution for your public
folders.
Sascha Schmatz
Global Service Manager Messaging, Quimonda AG, Germany
We implemented a fully
backup-less solution with Exchange 2010. My company has 12,000
employees spread over 12 locations (North America, Europe, and Asia).
All Exchange 2010 servers are centralized at a single site located in
Germany. To prevent problems from going backup-less, we defined the
following corporate policies for the messaging service:
A user can
recover deleted messages for 14 days; older deleted messages are no
longer available. VIPs get 30 days, but this is only available for
fewer than 100 users. Public Folders are only used for Free/Busy Information and nothing else.
To realize a backup-less
messaging system, we implemented DAGs with two database copies and one
lagged database copy (lag time: 14 days) on a RAID. The database copies
are stored on different storage systems in the same location—thus we
make sure that a disk failure does not affect all copies at the same
time. Because we do not run any backup, we enabled circular logging for
all databases. We're also using a lagged database to protect us from
logical database corruption. For that reason we can very easily give up
doing backups.
For the single item
restore, we configured Single Item Recovery to 14 days and enabled it
for every mailbox. This increased user satisfaction; users previously
had only four days of deleted items to recover. Using DAGs and Single
Item Recovery enabled us to move away from backup to disk
solution/snapshots that we had to maintain for Exchange 2007. As you
can guess, going backup-less saved us a lot of money without
interfering with our data-protection policies.
|
1. Using Lagged Database Copies
A lagged database copy is a
database that uses a delayed replay lag time to commit the log files to
the database. This allows you to go back to a point in time (maximum 14
days). Because 14 days is the fixed upper window for a lagged copy,
this might not be the right solution for you to fit all scenarios,
especially those scenarios where you need to restore items older than
14 days. By delaying the replay of logs in to a database, you have the
capability to recover it to a point in the past.
Lagged
database copies can protect you from the extremely rare logical
corruption type cases as described in the following scenarios:
Database Logical Corruption
This is the case when the database pages checksum matches, but the data
on the pages is logically wrong. It can occur when ESE attempts to
write a database page and the operating system storage stack returns
success even though the data either never makes it to disk or gets
written to the wrong place. This behavior is called lost flush.
To prevent lost flushes, ESE includes a lost flush detection mechanism
in the database with a single database page restore feature.
Store Logical Corruption
This means data is added, deleted, or modified in a way that is not
accepted by the user, so the user views it as a corruption. Typically
this is caused by third-party application that issues a series of valid
MAPI operations against the store. An example is a corrupt archiving
solution that changes all message items of the users. Single Item
Recovery or retention hold provides some protection against this case
because all changed items are kept and thus can be restored. However,
especially when large amounts of data are changed, it might be easier
to recover the database to a point back in time before the corruption
occurred.
Rogue Admin Protection
This is the case where the organization seeks protection against
malicious or rogue administrators, particularly against administrators
that by intention add, change, or remove data from the system in a way
that is seen as undesirable by the users. To protect against this, the
lag database copies can be placed on a server that is under separate
administrative control.
If you use multiple
database copies and Single Item Recovery, only the extremely rare
catastrophic store logical corruption case is interesting. Depending
upon which third-party applications you use and your history with store
logical corruption, lagged database copies may or may not be that
interesting for you.
In the following scenarios lagged database copies can be used to recover data:
Recovering a log file that was deleted on the source
Rolling back to a point in time because of a virus outbreak
Recovering a deleted item that is outside the retention time
1.1. Planning Lagged Database Copies
When planning for lagged
database copies, you should carefully consider the implications this
brings to your storage planning. Every lagged database needs sufficient
disk space for holding the database as well as the log files for the
configured time.
For example, at Microsoft, 14
days of logs for one database result in about 60,000 log files or 60 GB
of data. The log storage design for the lagged database copy needs to
accommodate this. In addition to the space requirements, consider the
following criteria when deciding the replay lag time:
How long does it take
you to identify a logical database corruption? This should include
non-working days such as weekends. So if you configure a replay lag
time of two days, you might not be able to identify the problem when it
happens on a weekend and you're back on Monday.
Consider
the maximum time where a replay lag time makes sense. Fourteen days is
the maximum time possible, but do you really need the full 14 days? In
most cases, 7 days should be sufficient to identify a corruption and be
able to recover using the lagged database copy.
Don't
underestimate the space requirements needed the longer the replay lag
time is defined. In the previous Microsoft example you needed to
reserve 60 GB for 14 days; thus 7 days would save you 30 GB per
database of storage that you need to have available.
The
duration of replaying the log files is also worth considering. You
should plan a test to replay all log files; this might take a
considerable amount of time. Replaying 14 days of logs might require
several hours before the database is up to date.
Besides the replay lag time considerations and the storage design, you should plan the following considerations carefully:
How many lagged
database copies do you need? Normally one lagged copy should be
sufficient, but maybe you want more copies because of your
disaster-recovery requirements. If lagged database copies are a
critical piece of your disaster-recovery strategy, you will probably
want to put them on a RAID system or have multiple copies of them.
Where
should you store the lagged database copies—at a server at the same
site or offsite? This decision has a direct impact on the time you need
to recover the lagged database copy because you need to consider
available bandwidth when storing them offsite.
On what Exchange
server should you place the lagged database copies? You have the option
to place them on the same server where your active database copies are
stored, or you can use a single server just for all lagged database
copies, such as a dedicated public folder server.
Lagged
database copies always should be activation-disabled and have the
highest activation preference number available. This is required to
prevent automatic activation by mistake or resulting from a system
failure.
You should make the best
decision for your own situation. Don't start with the maximum of 14
days for replay lag time, but make a decision that suits your needs
considering both disaster recovery and budgetary (or storage design)
aspects.
Note: Lagged
database copies are not updateable with the single page restore
feature. If a lagged database copy hits a page corruption, you will
have to reseed to repair it (and subsequently lose the lagged aspect to
the copy). It is therefore best practice to either deploy the lagged
database copy on RAID or create multiple lagged database copies when
using JBOD.
1.2. Deploying Lagged Database Copies
You configure a lagged database copy using the EMS by following these steps:
Create a database copy to the target server where you want to store the lagged database copy.
Configure the ReplayLagTime of the database. The following cmdlet configures a lag time of 7 days to the database DAG01-BERLIN-01 located on Berlin-MB01: Set-MailboxDatabaseCopy –id DAG01-BERLIN-01\Berlin-MB01 –ReplayLagTime 7.0:0:0.
Block
auto activation of this database to make sure it is not activated by
mistake. You use the following cmdlet to perform this task: Suspend-MailboxDatabaseCopy <database\server> -ActivationOnly -Confirm:$false.
If you use a dedicated Exchange server that hosts all lagged database copies, you can block automatic activation of databases also on the server level by using the following cmdlet: Set-MailboxServer <mailbox server> –DatabaseCopyAutoActivationPolicy Blocked.
When the lagged database
copy is configured, you will see that the replay queue length of the
lagged database will increase, as shown in Figure 1.
Note: To verify that all logged database copies are not automatically activated, use the Get-MailboxDatabaseCopyStatus –Server <name> | ft Name, Act* cmdlet and make sure that the ActivationSuspended property is set to true.
1.3. Using a Lagged Database Copy to Recover Data
Using a lagged database
copy to get to a specific point in time is rather difficult because you
have to know the exact time frame in which something occurred. In
addition, no tools are available to tell you which log file contains
exactly what database change. Thus you have to estimate which log files
need to be replayed so that you get the database to the point in time
that you require. You must simply guess when you grab the database and
logs files and then replay the logs manually before you can recover
data from a recovery database.
Recovering a lagged
database to a specific point in time is a manual process, so follow
these steps to receive the data you're looking for:
Suspend replication to the lagged database copy by using the Suspend-MailboxDatabaseCopy <database>\<server> cmdlet.
Note: You
should now decide whether you want to back up or copy the database and
log files to a different location so that you have them available if
you don't get to the right point in time. You alternatively can create
a VSS snapshot using the VSSAdmin CREATE SHADOW /For=<Volume that includes database> command.
Use
Explorer to delete or move all log files that are newer from the log
file's time stamp than the time you decided to go back. For example, if
you have 14 days of log files available, and you want to replay the log
files to get back 10 days, you only need to commit those log files to
the 14-days-old database, that are 10 days and older. In order to
achieve this, you need to delete or move all log files that have a time
stamp newer than 10 days, like day 9 or newer.
Delete the .chk file for the database and note its filename. It should normally be something like E00.chk.
Run the Eseutil.exe /r E00 /a command but replace E00
with the filename of the .chk file. Depending on the number of log
files that need to replayed, this might take several hours. A rule of
thumb is that on normal 7.2K JBOD 3.5-inch disks, you can assume that
you'll replay approximately 7.2 GB of transactional log files per hour.
The exact value, of course, depends on your local factors such as
storage performance or CPU.
Note: If you want to measure how long replaying the log files to the database takes, you can use the tool JetStress 2010, which includes a Recovery
Performance measure option for this exact situation. You can download
Microsoft Exchange Server Jetstress 2010 (64 bit) at http://go.microsoft.com/fwlink/?LinkId=178616.
When Eseutil is finished, the database is in clean shutdown state. You can now decide how to continue:
You can create a recovery database using this database, mount it, and recover the data .
You can replace the corrupt database files with the lagged database files and mount the database.
As you can see, several
steps are involved here and the process is time-consuming because of
the large number of logs that must be replayed. The process is not
difficult, but is not something you want to be doing on a daily/weekly
basis because of the operational time required. Lagged copies were not
designed for the deleted item recovery case—they were designed for the
once-in-a-great-while scenario where multiple database copies within a DAG combined with retention hold is not enough protection in a backup-less environment.
Note: As
already mentioned, no tools are currently available from Microsoft that
allow you to automate the process of recovering a lagged database copy
to a specific point in time. However, third-party vendors may soon
provide solutions for this situation. Check the Internet regularly for
updates.
2. Backups and Log File Truncation
Log
file truncation, or deleting the transactional log files that are no
longer required for a successful database restore, takes place once you
do a successful backup. But if you do not perform a backup in
situations where you decided to no longer use traditional point-in-time backups, how will you make sure the log files are removed so they don't pile up? Simple: they are never removed.
For this reason you need to configure log file truncation by enabling circular logging. You can enable circular logging on a database either in EMC or in EMS using the Set-MailboxDatabase –Identity <DatabaseName> -CircularLoggingEnabled $True cmdlet.
Once you enable circular logging when multiple database copies are in place, you get a new type of circular logging called continuous replication circular logging (CRCL) which behaves differently from traditional circular logging known from Exchange 2007 and before.
CRCL
is performed by the Microsoft Exchange Replication Service, not the
Microsoft Exchange Information Store service. Also, CRCL requires
considering log files that are required for log shipping and replay
before removing them. This situation needs special logic to ensure that
all database copies process the log file before it is removed, which
differs from the traditional circular logging logic where the log file
was deleted when it was committed to the database.
When CRCL is enabled, log file truncation for database copies that are not lagged occurs in the following way:
The log file is checked to determine whether it is below the checkpoint.
The log file is inspected that all other non-lagged database copies replayed the log file into their database.
The log file has been inspected by all database copies (including any lagged database copies).
Log file truncation happens for lagged database copies in the following way:
The log file is checked to determine whether it is below the checkpoint.
The log file is older than ReplayLagTime and TruncationLagTime.
The log file is already deleted on an active database copy and all copies agree on the deletion.
3. Reasons for Traditional Point-in-Time Backups
Even though Exchange Server
2010 supports backup-less scenarios, in some cases your organization
may want to maintain its traditional backup methods. Keep in mind the
following argumentations when discussing the pros and cons of a
backup-less infrastructure:
No Available DAGs
Organizations that do not use DAGs need to consider traditional ways to
back up their databases. A reason for not implementing DAGs is often
that they are too expensive to deploy—DAGs require a Windows Server
Enterprise Edition license.
Single Exchange Server Implementation
Single Exchange Server implementations are not conducive to DAG usage
because they require adding more server hardware. Traditional backups
to disks or tapes are the option to follow here.
Utilizing an Existing Backup Environment
Your company's backup strategy might force you to follow other
applications if you have an existing backup environment in which all
other applications will back up their data, so that even when you
maintain multiple copies of your database, you are required to have a
copy of it in your backup environment.
Compliance Requirements
You typically use tape backups if you have an archival reason to
preserve data for an extended time, as governed by compliance
requirements. You also need to ensure that you can access the data in
the future, especially if the storage is long term—sometimes up to 10
years.