Active Manager runs as part of the Microsoft Exchange
Replication Service (MSExchangeRepl, not to be confused with the
Mailbox Replication Service) on every server within a DAG. Conceptually,
Active Manager is the orchestrator for native-mode high availability in
Exchange because it decides which database copies are active and which
are passive, taking into account administrator preferences such as the
database activation preference order. You can regard Active Manager as
the successor of the resource management model used by previous
iterations of Exchange clustering technology.
Active Manager can operate in two roles. One server in the DAG takes on the Primary Active Manager (PAM) role and the others operate in a Standby Active Manager (SAM)
role. Whether in PAM or SAM mode, servers continually monitor databases
at both the Information Store and Extensible Storage Engine (ESE)
levels to be able to detect failures. However, it is the PAM that
determines which database copies are currently active and those that are
passive; the SAM concentrates on monitoring
the health of the Information Store process and the databases that run
on the local server. Once it detects a failure of the Information Store
or a database, the SAM on that server asks the PAM to initiate a
failover, providing that the databases affected by the failure are
replicated and a copy exists that the failover can make active. A
failure might be caused by a storage failure that takes one or more
disks offline or a problem that causes the ESE database engine to
consider that a disk is unresponsive. If the server hosting the PAM is
still online, it initiates the failover. If this server has been taken
offline by a failure, another server in the DAG seizes the PAM role and
begins to bring any necessary database copies online to restore service.
The server that holds
the PAM role is responsible for processing topology changes that occur
within the DAG and making decisions about
how to react to server failures, such as deciding to perform an
automatic transition of a passive copy of a database to become active
because the server that currently hosts the active copy is unavailable
for some reason. The PAM server owns the cluster quorum resource for the
default cluster group that underpins the DAG. If the server that owns
the cluster quorum resource fails, the PAM role moves to the server that
takes ownership of the cluster quorum resource. We will discuss the
criteria used by the PAM to select the database copy to activate
shortly. Once a new database copy has been successfully mounted, the PAM
updates the RPC Client Access service with details of the server that
hosts the newly activated copy so that client connections can be
directed to the correct server.
The active copy of a database is referred to as the mailbox database master. Its copies can be moved from active to passive state through a switchover,
which is a change initiated by an administrator, perhaps in preparation
to take a server offline to apply a service pack or other software
upgrade, or a failover,
which is the result of a hardware or software outage that prevents a
database from functioning properly. In either case, Active Manager is
responsible for selecting and enabling a new copy to accept incoming
client traffic and is the definitive source of information about
what server has mounted the currently active copy of a database.
Configuration information is written back into the cluster database and
updated there by Active
Manager, but transient information relating to current database status
is held in memory. You can view information about a DAG, including
information about the names of the servers in the DAG and the current
Active Manager, with the Get-DatabaseAvailabilityGroup cmdlet.
The following output is an
edited version of the properties of a DAG returned by the
Get-DatabaseAvailabilityGroup cmdlet. You can see that the DAG is
composed of two servers (ExServer1 and ExServer2), that ExServer1 is
currently serving as the PAM, and that a server called adroot.contoso.com hosts the FSW.
Get-DatabaseAvailabilityGroup -Identity 'DAG-Dublin' -Status | Format-List
Name : DAG-Dublin
Servers : {EXSERVER2, EXSERVER1}
WitnessServer : adroot.contoso.com
WitnessDirectory : C:\DAG-Dublin\FSW
AlternateWitnessServer :
AlternateWitnessDirectory :
NetworkCompression : InterSubnetOnly
NetworkEncryption : InterSubnetOnly
DatacenterActivationMode : Off
StoppedMailboxServers : {}
StartedMailboxServers : {}
DatabaseAvailabilityGroupIpv4Addresses : {}
DatabaseAvailabilityGroupIpAddresses : {}
AllowCrossSiteRpcClientAccess : False
OperationalServers : {EXSERVER1, EXSERVER2}
PrimaryActiveManager : EXSERVER1
ServersInMaintenance : {}
ThirdPartyReplication : Disabled
ReplicationPort : 64327
NetworkNames : {DAGNetwork01}
WitnessShareInUse : Primary
AdminDisplayName :
ExchangeVersion : 0.10 (14.0.100.0)
DistinguishedName : CN=DAG-Dublin,CN=Database Availability
Groups,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative
Groups,CN=contoso,CN=Microsoft Exchange,CN=Services,CN=Configuration,
DC=contoso,DC=com
Identity : DAG-Dublin
ObjectCategory : contoso.com/Configuration/Schema/
ms-Exch-MDB-Availability-
ObjectClass : {top, msExchMDBAvailabilityGroup}
OriginatingServer : ADRoot.contoso.com
If the PAM
fails, its functions automatically pass to another server in the DAG
that takes over ownership of the cluster quorum resource. You can also
move the cluster quorum resource to another server to transfer the PAM
role if you want to take the server that holds the role offline for some
reason, such as to apply a software upgrade or patch.
One interesting aspect of the introduction of Active Manager is that the internal working of Active
Manager and its ability to perform database transitions remains
confidential within Microsoft. Third-party clustering or
high-availability solutions that work with Exchange can upgrade their
products to accept a direction from Active Manager to perform a
transition, but they cannot integrate at a lower level.
1. Automatic database transitions
Figure 1
illustrates an example of a DAG containing three servers, each hosting
two databases. Each of the databases is replicated to another server to
provide a basic level of robustness to a server outage. If server 1
fails, which hosts the active copies of databases 1 and 2, the Active
Manager process will reroute user connections to pick up the copies of
the databases on servers 2 and 3. Users connected to database 1 will be
redirected to server 2 and those connected to database 2 will go to
server 3. Similarly, if the disk holding database 2 on server 1 fails,
Active Manager will detect the problem and reroute traffic to server 3.
In this scenario, each database
has just one copy and you might decide that the probability that more
than one server will ever fail at one time is negligible, so it is
sufficient to rely on the single additional copy. However, if the DAG
extended across more than one datacenter you would probably configure
every database to be replicated to all servers. In this scenario, copies
of databases 1 and 2 would be present on server 3, meaning that if
servers 1 and 2 were both unavailable, users could still get to their
data using the copies hosted on server 3.
The number of copies you can
create for an individual database is limited only by the number of
available servers in the DAG, disk space, and available bandwidth. The
high-capacity bandwidth available within a datacenter means that the
availability of sufficient disk space to hold replicated databases and
transaction logs is likely to be more of an issue. This issue is
somewhat negated by the ability to deploy databases on low-cost drives,
providing there is sufficient rack space, power, and cooling within the
datacenter to support the disks. One sample environment has 15 servers
in a DAG. There are 110 active databases and each database has two
passive copies, so there’s a total of 330 databases in the environment.
Each server supports 22 databases (a mixture of active and passive) and
dedicates 18 TB of storage for mailbox databases. Having three copies of
each database is a reasonable approach to ensuring high resilience
against a wide range of failures, including an interesting design point
of planning database copies so that a failure that affects a rack cannot
prevent service to a database. In other words, you should not put all
the servers that host a database and all of its copies in the same rack.
The Microsoft Exchange
Replication Service monitors database health on each DAG member to
ensure that active databases are properly mounted and available and that
ESE has signaled no I/O or corruption errors on a server. If an error
is detected, the Replication Service notifies Active Manager, which then
begins the process of selecting the best possible available copy and
then making that copy of the database active to take the place of the
failed database.
Before discussing the details of how automatic
database transitions occur, it’s worth making the point that Exchange
does not invoke a transition when an administrator dismounts an active
database. This action is deemed to be one that the administrator has
taken deliberately, possibly to allow maintenance to proceed or even for
something more prosaic such as enabling circular logging for a
database. Exchange therefore has no reason to consider a failover
because these are designed to occur automatically as a result of change
in the operating environment that is not caused by an administrator.
|
2. Best copy selection
To make the choice to replace a failed copy of a replicated database, Active Manager runs a process called best copy selection (BCS).
The aim of BCS is to take all possible complications and administrative
blocks into account when selecting the right database copy to transfer
service to, as it’s obviously not a good thing to attempt to transfer
service to a database that is failing in its own right. BCS works by
creating a sorted list of available database copies after ignoring any
database copies that are:
On servers that are currently unreachable for some reason (network failure, maintenance, or other conditions).
Administratively blocked from activation because the DatabaseCopyAutoActivationPolicy property of the database has been set to Blocked using the Set-MailboxDatabaseServer cmdlet.
Hopefully, the resulting
list after exclusions has at least one copy. If only one database copy
is available, Active Manager runs the attempt copy last logs (ACLL)
process to bring that database copy up to date. If more than one
database copy is available, Active Manager proceeds with the BCS process
by sorting the list according to the copy queue length, with the copy
with the fewest outstanding logs to copy at the top of the list. The
actual sort is performed using the value of the LastLogInspected
property. This property contains the date and time that the last
transaction log file was inspected for the database copy. The end result
is that the database copy that needs the least work to update after
activation is at the top of the list.
Administrators are able to indicate a preference for a database copy to be activated in case of failure by setting the ActivationPreference
property on a database copy with the Set-MailboxDatabaseCopy cmdlet.
For example, in a DAG where database copies are maintained in two
datacenters, it would be normal to set the activation preference so that
the copies in the local datacenter are activated ahead of those in the
remote datacenter. In this scenario, you can run this command to assign
an activation preference value of 2 to the database copy DB1 on server
ExServer1:
Set-MailboxDatabaseCopy -Identity 'DB1\ExServer1' -ActivationPreference 2
You don’t want to get into a
situation where a database copy that is less preferred for activation is
brought online ahead of a more preferred copy, but the only way
Exchange knows your preferences for activation is if you set them. To
respect administrator choice, Active Manager sorts the list by
Activation Preference. Active Manager now knows the state of health of
the database copies that are available to it sorted in activation
preference.
It’s possible that the
state of health of the copies might not be as good as we’d like, so
Active Manager now reviews the state of health of each database copy
more thoroughly using criteria such as the following:
Database status: Healthy is best because it means that the database copy is ready to go. The other status values are DisconnectedAndHealthy, DisconnectedAndResynchronizing, or SeedingSource. All of these status values indicate that more work is required to bring the database copy online.
Content
index: Once again, a healthy status is best because it indicates that
all of the content in the database has been indexed. A status of
crawling indicates that Exchange is still indexing the content of the
database copy.
Copy
queue length: It is best when the queue contains 10 transaction logs or
fewer, as this means that the database copy has been copying logs from
the active database to keep up to date.
Replay
queue length: Once again we want to see a moderate queue length of less
than 50 logs awaiting replay. More than this number indicates more work
is required to bring the database copy online.
Active Manager goes through
its list sorted by activation preference and will select a database
copy that is healthy, has a copy queue length of less than 10, and has a
replay queue length of less than 50 If it can’t find a database copy
that matches these criteria, it reduces the standard for activation and
looks for a database copy with a content index of “Crawling.” If this
doesn’t work, it reduces its standard further until it can find a
database copy that matches. Of course, it is possible that Active
Manager will have to significantly reduce its selection standard before
it can find a matching database copy, and that copy might have a low
activation preference and be in the process of reseeding. Therefore,
it’s going to take a considerable effort to bring the copy online and
restore service.
It’s also possible that no
database copy will meet even a reduced standard for activation. For
example, if all of the other database copies are offline, Active Manager
won’t be able to activate any database copy automatically and reports
failure. In this case, the administrator has to either fix the problem
with the original database copy and bring it back online or address the
issues that prevent Active Manager from bringing one of the other copies
online.
Note:
The full set of criteria for the standard for activation is described in “Understanding Active Manager” at http://technet.microsoft.com/en-us/library/dd776123.aspx.
Note that Microsoft tweaked the best
copy selection policy in SP1 so that if the AutoDatabaseMountDial
property of a database is set to “Lossless”, Active Manager uses the
activation preference as the primary sort key when it selects a database
copy to activate. This change reinforces the importance of activation
preference in environments where data loss is not tolerated during
failovers.
3. ACLL: Attempt copy last logs
Assuming that Active Manager
can determine a suitable database copy to activate, it instructs the
Replication Service on that server to run the ACLL
process to copy any missing transaction logs from the server that hosts
the failed database copy. Exchange 2010 obviously faces additional
complexity over the log shipping scenario in Exchange 2007 in that
copies of transaction logs that might be required to bring a database up
to date can exist in many more servers than before.
The purpose of ACLL
is to assemble all available data in the form of transaction logs to
allow the Replication Service to update the database copy before it is
mounted and made available to users. The best outcome is when all
outstanding transaction logs can be copied from the server that hosted
the failed database copy because the Replication Service will then be
able to replay all the logs and the database copy will be completely up
to date and a lossless failover is successful.
Of course, the nature of
failure is that a database is most often taken offline because of a
storage outage. In this case the disk holding the transaction logs might
also be affected and the Replication Service won’t be able to copy any
logs. In this scenario, the AutoDatabaseMountDial property of the mailbox server is consulted to establish the tolerance for data loss caused by missing logs.
Warning:
The default value for the AutoDatabaseMountDial setting on an Exchange 2010 mailbox server is BestAvailability,
meaning that the mailbox server is happy to mount a database if up to
12 transaction logs are missing. This represents a potential data loss
of 12 MB. Much of these data are probably messages that Exchange can
recover from the transport dumpster, so it is acceptable to go ahead and
mount the database and run the risk of a small data loss. Other values
are GoodAvailability, representing a tolerance of 6 transaction logs, and Lossless, meaning that no data loss is tolerated.
Exchange will not mount a database automatically if the number of missing transaction logs exceeds the limit set by AutoDatabaseMountDial.
Two actions can be taken from this point. The administrator can decide
that she wants to force the database to mount and accept whatever data
loss is incurred or she can locate the missing transaction logs from a
backup (if one is available that has these logs) or a server that hosts
another copy of the database.
If the ACLL process
completes successfully, Active Manager has a database copy that is up to
date and ready to mount. It therefore goes ahead and issues a mount
request to bring the selected database copy online and make it available
to clients. However, further checks occur to ensure that mounting the
database will not exceed the maximum number of active databases
configured for the server or that the database copy
is suspended for activation.
If ACLL
does not complete successfully, Active Manager selects another database
copy and starts the process again. This cycle continues until a
database copy is successfully activated and mounted or Active Manager
reaches the end of the list of available copies and has to declare
failure.