Exchange Server 2010 : Breaking the link between database and server

8/25/2011 3:21:46 PM

Even though it was flawed in places, the introduction of continuous replication in Exchange 2007 was a big step forward to achieving in-product high availability. The final piece in the jigsaw came with removing the historical tight connection between server and mailbox database. If you look at the configuration data for Exchange 2007 in Active Directory directory service, you see a structure of Organization–Administrative Groups–Servers–Databases. In other words, databases are a child of servers and each database is owned by a server. The situation is completely different in Exchange 2010, where the structure is Organization–Database Availability Groups/Servers/Databases. Now DAGs, servers, and databases are held at the same level within the Exchange organization, and links connect these objects to establish the servers and databases that are in a DAG. Other links connect database copies to databases and the servers that host the copies. A new system management component called Active Manager uses this information to understand what database copy is currently active on what server and what available passive copies exist. The link between database and server is broken and no longer exists.

Exchange Server 2007 also introduced the concept of database portability, which means that you can take a mailbox database from one server and mount it on another server. In Exchange 2010, Microsoft refers to this capability as database mobility. The major difference between portability and mobility is that you are not moving a database from server to server. Instead, you move the active focus for client connections and workload between copies of a database. All of the copies of a database share the same globally unique identifier (GUID) or identity, so each is able to function as the active master no matter what server it is currently located on. Database copies also must have the same path on the server for database and log files. The ability to move databases around servers within a DAG is fundamental to the ability to manage database-level failures and achieve high availability within Exchange and without recourse to third-party software as previously required.

Exchange 2010 treats a mailbox database as a unit of failover in that it can be moved between servers in a DAG as problems occur. Of course, you do not need to define DAGs within your organization, and you can run Exchange as before with databases that never move off their host server. If you elect to deploy a DAG, servers become members of the DAG and are able to host active and passive copies of mailbox databases that are hosted by the DAG. All of the services that you expect to run on a mailbox server still exist (the Information Store service, and so on) and operate on the mailbox databases that are currently hosted by a server, including running the Replication Service to process incoming copies of transaction logs to update passive copies of mailbox databases whose active copies exist on other servers elsewhere in the DAG.

INSIDE OUT: The advantages to multiple passive database copies

Just like log replication in Exchange 2007, a database only ever has one active copy at a given time, but in Exchange 2010 you can create multiple passive copies up to the number of available servers in the DAG. The more copies that exist for a database, the more likely it is that you can quickly recover from an outage that affects a server or some storage attached to a server and the less likely that your infrastructure has a single point of failure. The reduction in disk I/O in Exchange 2010 and the ability to use cheaper disk technology to host databases mean that you can afford to maintain more database copies. Of course, you have to maintain a reasonable balance here to ensure that you don’t create more database copies than you really require.

For example, in a DAG that spans 10 servers, a database can be active on one server and you can have its contents replicated to passive copies that are managed on the 9 other servers in the DAG. This is a rather extreme example and it’s more likely that databases will have three passive copies to achieve a good balance between the ability to recover from different outage scenarios and the amount of data replication that is required to keep the passive copies updated. In a situation where the DAG stretches across multiple datacenters or a need exists for a lagged copy, the number of passive copies might be increased to four. The point is that system designers now have great flexibility in terms of the way that they protect data in different circumstances.

As you would expect, the active copy of a database can be mounted or dismounted. If mounted, the database is generating transactions that Exchange replicates to the target servers that host copies of the database. Nothing much happens for dismounted databases.

Apart from its mounted state, Exchange 2010 defines a database to be either the source for replication or the target for replication. A database copy can act as the source or target but cannot function as both at the same time. In much the same manner, a database copy can be active, meaning that it is available to service incoming connections from email clients, or passive, meaning that it is available to be switched into active mode to take over service, but it cannot be both active and passive at the same time. Only one copy of a database can be active within the DAG at any time, and a server cannot host more than one copy of a database. All of this is quite logical and provides the framework within which replication and database transition from active to passive and back again occurs.

1. Introducing Database Availability Groups

Microsoft introduced storage groups as the basis for database management in Exchange 2000. Databases fit inside storage groups, which in turn belonged to servers. All of the databases in a storage group shared a common set of transaction logs, and transactions from all of the databases in the storage group are interleaved in the transaction logs. From Exchange 2003 onward, if you had a problem with a database, you could use the Recovery Storage Group to access a recovered copy of a database and retrieve mailbox data. However, although it was sometimes convenient to use storage groups for management, eventually Microsoft determined that they introduced an extra layer of complication for administrators and the process to back storage groups out of the product began in Exchange 2007, in which the continuous log replication feature works only for storage groups that hosted just one database. You could still put more databases in a storage group, including public folder databases, but Microsoft gave a clear indication that they preferred single database storage groups, especially if you wanted to exploit their investment in Exchange’s high-availability features. It therefore comes as no surprise that storage groups have disappeared in Exchange 2010.

Removing storage groups simplifies administration but doesn’t help with high availability. Log replication in Exchange 2007 helps to deliver more highly available messaging, but it is limited to one source server and one target server. Exchange 2007 CCR and SCR deployments proved that the mechanism of shipping transaction logs to target systems where the Replication Service replayed the contents of the log to update a passive copy of the source database to keep it updated and ready to switch in case of problems with the original database worked. Exchange 2010 builds on Exchange 2007 to allow a single database to have multiple copies that are tied together into a new structure called a Database Availability Group (DAG). The term is new to Exchange and it has an unfortunate definition in other places. Wikipedia mentions that a dag is a clump of dung stuck to the wool of a sheep.

Fundamentally, a DAG is a collection of databases and copies that are shared across up to 16 servers. Although 16 might seem an arbitrary figure to use as the limit of the servers that a DAG can support, in fact, it’s a limit imposed by Windows Failover Clustering, which can only support 16 nodes in its clusters. As Windows Failover Clustering underpins the DAG, the restriction flows through to the DAG.

In any case, 16 seems like a number that should be sufficient for most deployments and is certainly enough to explore just how far Microsoft can push the envelope for the combination of technologies that constitute a DAG. These include the following:

Log replication (the technology to implement transfer and replay plus the network load to support replication)
Networks
Windows Failover Clustering
Monitoring and management tools
Server and storage hardware

After all, if we look at the previous generation of single-copy clusters, aside from Microsoft’s own internal implementation, relatively few customers ever went past four servers in a cluster so there is no obvious demand for megaclusters spanning hundreds of servers. This is the first implementation of the DAG and it is possible that Microsoft will consider whether it is feasible and advantageous to increase the number of servers that a single DAG can support in future versions of Exchange or to leverage new features delivered in a new version of Windows. For now, we remain at 16. The simple answer of putting individual servers in multiple DAGs is available if you need to protect databases on more than 16 servers.

The DAG implements the concept of an active (or primary) database—the one to which users currently connect—and its copies on other servers that can be swapped into place to become the active database. The database copies are kept updated through log replication and replay. If a problem occurs on a server that renders the databases running on the server inaccessible, the DAG can activate a copy and make it the active copy. The new remote procedure call (RPC) Client Access Layer redirects client connections seamlessly to the newly activated copy . Databases cannot be swapped around between servers if the product architecture demands that databases are firmly attached to servers, which is the traditional approach taken by all previous versions of Exchange. The introduction of the DAG smashes the link between a database and the owning server to make portable databases the basic building block for high availability in Exchange 2010. This is probably the most fundamental architectural change that Microsoft makes in Exchange 2010. The servers within a DAG can support other roles, but each DAG member must have the mailbox role installed because it has to be able to host a mailbox database.

The servers in a DAG can be on different subnets and span different Active Directory sites as long as the underlying network infrastructure supports sufficient bandwidth to transfer the expected volume of transaction log files between the different servers. To ensure smooth operation, Microsoft recommends that mailbox servers in a DAG are connected with a network that accommodates a round-trip latency of 250 milliseconds or less. In addition, Microsoft recommends that you block cross-network traffic between the datacenters to avoid excessive heartbeat traffic across the cluster and so conserve available bandwidth for more important activities such as log replication. There are other issues to consider when a DAG stretches across two datacenters, including the assignment of IP addresses for the networks, using appropriate Domain Name System (DNS) time to live (TTL) settings to ensure that clients pick up network changes quickly in the event of failovers, and the provision of suitable names for all of the services offered to clients by Exchange from both datacenters.

An Exchange 2010 server running the enterprise edition can support up to 100 active databases. This number is increased when you include passive database copies that a server hosts for other servers to a combination of up to 100 databases with a continuing limit of up to 100 databases owned by the server. Even though Exchange 2007 servers include an earlier version of log replication technology, you cannot include Exchange 2007 servers within a DAG.

Note:

Messaging Application Programming Interface (MAPI) clients connecting to servers running Exchange 2007 or an earlier version know that their endpoint is a mailbox in a database on a particular server. The msExchHomeServerName property of the user’s Active Directory account points to the server and the HomeMDB property points to the database. Exchange 2010 breaks the connection between server and database so a new scheme is required to determine the endpoint for MAPI clients. Exchange 2010 does not use the msExchHomeServerName value because it would be too expensive to apply updates to potentially thousands of Active Directory objects during a database transition within a DAG. Instead, the following algorithm is used:

Fetch the homeMDB property for the user object in Active Directory. The full value of the property is shown here. The first CN in the property (“Dublin Users”) identifies the database that holds the user’s mailbox.
```
CN=Dublin Users,CN=Databases,CN=Exchange Administrative Group
(FYDIBOHF23SPDLT),CN=Administrative Groups,CN=contoso,
CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=contoso,DC=com
```
Truncate the value of the database’s legacyExchangeDN property to determine the server legacyExchangeDN. This isn’t the actual server that hosts the database. Instead, it is the Client Access Server (CAS) server that currently provides the RPC Client Access service for the database.
```
/o=contoso/ou=Exchange Administrative Group (FYDIBOHF23SPDLT)
/cn=Configuration/cn=Servers/cn=ExServer2/cn=Microsoft Private MDB
```
Use the server LegacyExchangeDN to determine where the active copy of the mailbox database (MDB) is currently mounted. This value is available in the msExchOwningServer property of the database object.
Connect to the MAPI endpoint provided by the CAS server. The CAS server will handle communications with the mailbox server.

This is just one example of how the code base in Exchange has changed to accommodate the introduction of the DAG.

2. The dependency on Windows clustering

Underneath the hood, the DAG uses Windows Failover Clustering technology to manage server membership within the DAG, to monitor server heartbeats to know what servers in the DAG are healthy, and to maintain a quorum. The big difference from clustering as implemented in other versions of Exchange is that there is no concept of an Exchange virtual machine or a clustered mailbox server, nor are there any cluster resources allocated to Exchange apart from an IP address and network name. Windows Failover Clustering uses the network name to update the password for the Computer Name Object for the cluster. The Primary Active Manager (see the next section) also uses the list of possible owners of the cluster’s File Share Witness (FSW) resource as the candidate servers when it needs to transition due to a server outage. However, apart from these small points, there’s no practical dependency on the network name as Exchange reverts to server names if the cluster name is unavailable. In fact, you never need to manage cluster resources such as nodes, network, or storage using the Windows Failover Cluster Manager because everything is managed through Exchange—and if you attempt to change the cluster settings with Cluster Manager, there’s a fair chance that you could end up breaking something on which Exchange depends. In effect, Exchange provides a blanket that hides the complexity of cluster technology from system administrators.

INSIDE OUT: Operating system requirements

Even though Exchange uses a bare minimum of cluster technology, the dependency on Windows clustering means that you can only add mailbox servers to a DAG if they are running on Exchange 2010 Enterprise on Microsoft Windows Server 2008 (SP2 or R2) Enterprise edition. It also means that all of the DAG member servers must be part of the same domain. You should also run the same version of the operating system on all of the DAG member servers; you definitely cannot mix Windows Server 2008 SP2 and Windows Server 2008 R2, and it just makes good sense to keep the servers at the same software level. Exchange 2010 SP1 demonstrates that it is wise to keep all software at the same revisions on all DAG members; you can activate a database copy by moving it from a server running the RTM version of Exchange 2010, but you cannot perform the reverse operation and move the active database copy from a server running SP1 to one running RTM.

Despite its name, Windows Failover Clustering takes no part in the failover of Exchange mailbox databases. This functionality is provided by a new system management component within Exchange called the Active Manager, which maintains visibility of server conditions and the current state of databases, and is responsible for instructing servers to move database copies from active to passive and passive to active as required. Even better, an Exchange administrator doesn’t have to be concerned with the complexities of Windows clustering, as Exchange configures the limited clustering features that it needs (cluster heartbeat and quorum) when it adds the first mailbox server to a DAG. Some of the information relating to the DAG is held in Active Directory. This information tends to be static and doesn’t change very often, such as the name of the DAG. Other information that is more dynamic and prone to change quickly, such as database mount status (active or passive), is held in the cluster database.