As a cluster-aware
application, a SQL Server instance can be installed into an existing
Windows cluster, creating what's referred to as a failover clustering instance. Once installed, the instance is accessed using a network name (aka virtual server name) without needing to know which of the underlying physical cluster servers the instance is currently running on.
The abstraction
between a physical and a virtual server is a key component in providing
continued database availability after a failure event. In the event of a
server failure, the SQL Server instance automatically moves, or fails over, to another cluster server, while continuing to be accessed using the same virtual server name.
In this section, we'll
take a high-level look at clustering architecture, including its
benefits, limitations, and common usage scenarios.
1. Clustering architecture
Built on top of a Windows Server failover cluster, SQL Server clustering uses the shared nothing architecture, a term used to distinguish its clustering implementation from those of other database vendors, such as Oracle's Real Application Cluster
(RAC). Under the shared nothing architecture, a SQL Server database
instance is active on only one physical clustered server at any given
time.
Unlike a network
load-balanced solution, SQL Server failover clustering isn't a
high-performance solution; in other words, the performance won't be any
better, or worse, than a nonclustered implementation. SQL Server
clustering is purely a high-availability solution and you should not
choose it for any other reason.
Figure 1
illustrates a simple example of a two-node cluster with a single SQL
Server failover clustering instance. Ordinarily, the instance resides on
the Sales1 server. In the event of a failure of this server, the SQL Server instance automatically fails over to the Sales2 server. In either case, the SQL instance will continue to be accessed using the same virtual server name.
Key components of a Windows Server failover cluster solution are shared storage and resource arbitration. In figure 5.1,
we can see both the Sales1 and Sales2 servers have access to shared
storage. The databases contained in the SQL instance reside on the
shared storage to enable either server to access them when required. But
in order to prevent both servers from accessing them at the same time,
and therefore causing data corruption, the Windows clustering service
arbitrates ownership of the disk volumes based on the server's current
role in the cluster.
Clusters consist of one or more resource groups,
which are collections of resources required by a clustered application.
In the case of a clustered SQL Server instance, its resource group
would include the disks containing the database instance's data and
transaction logs, an IP address and network name, and the three
application services: SQL Server, SQL Agent, and full-text service. Failover
is a term used to describe the transfer of ownership of resources from
one server to another; it occurs at the resource group level, ensuring
that all resources required for the application are moved and available
on the failover server.
Given that a database
instance can be owned by various servers in the cluster, applications
are configured to connect to the virtual server name rather than the
name of a physical cluster server. To avoid confusion with the Microsoft
Virtual Server product, SQL Server 2005 and above refer to failover clustering instances rather than the virtual server name known in SQL Server 2000 and earlier.
The terms and concepts we've covered thus far will become
clearer, particularly when we walk through an installation of a
clustered instance. For now, let's turn our attention to the major
advantages and limitations of SQL Server clustering.
2. SQL Server clustering advantages and limitations
As a high-availability
technology, clustering has a number of advantages and limitations when
compared to other SQL Server high-availability options such as database
mirroring and transaction log shipping . Its primary advantage is that in the event of a server failure, the entire instance and all of its databases
are moved to a failover server, a process that usually takes no more
than about 90 seconds. This stands in contrast to mirroring or log
shipping solutions, which are established on an individual
database-by-database basis.
In addition to providing
protection from unexpected server failure, the other major benefit of
clustering is the ability to reduce downtime during planned outages. For
example, consider a requirement to upgrade a server's RAM. Unless we
have a server with hot-add RAM capabilities, we'd need to power off the
server, upgrade the RAM, and power back on again, and during this time
the database instance would be unavailable. In a clustering scenario, we
could manually initiate failover, which would move the instance from
one server to the other, enabling the upgrade to occur while the
instance is available on another cluster server. Thus, the downtime is
limited to the failover time, typically about 1 minute.
Unlike database
mirroring and transaction log shipping, the major limitation of
clustering, particularly in Windows Server 2003 and earlier, is that
other than a RAID solution, there's no protection from failure of the
disks containing the database files and/or the cluster quorum resource,
discussed shortly. Further, typical clustering solutions don't offer
protection from geographic disasters—that is, if all clustered servers
exist in a single location, the destruction of that location will result
in an outage. Although multisite clusters and the new Windows Server
2008 quorum models (discussed shortly) address these limitations, it's a
common practice to combine multiple high-availability solutions to
deliver the advantages of each while minimizing the individual
limitations. For example, important databases within a clustered SQL
Server instance can be mirrored to an offsite location for geographic
(and disk) protection.
3. Clustering in Windows Server 2008
A SQL Server
failover clustering instance is installed on, and in some ways is
constrained by, the underlying Windows cluster. Although Windows Server
2003 and earlier had a solid and full-featured clustering solution,
there were a number of limitations that constrained the configuration of
the installed SQL Server failover clustering instances.
The SQL Server 2008
release was preceded by the release of Windows Server 2008, bringing
with it a substantial improvement in the clustering options. In
comparison to Windows Server 2003, the clustering improvements in
Windows Server 2008 include the following:
An enhanced validation test, as shown in figure 2, which can be used to ensure the validity of the hardware and software components in forming a cluster
Support for IPv6 and up to 16 cluster servers (increased from 8)
The ability of cluster servers to obtain IP addresses via DHCP
Support for new quorum models, used to determine the number of failures a cluster can sustain while still operating
Apart from the
enhanced validation and management tools, the increase in supported
server nodes and relaxed networking restrictions enables the creation of
much more flexible clustering solutions, making Windows Server 2008 the
best server operating system for a clustered SQL Server 2008
deployment. Further enhancing its appeal are the improved quorum models.
Cluster hardware
should be certified by both the hardware vendor and Microsoft as cluster
compatible. With the release of Windows Server 2008, Microsoft
announced the Failover Cluster Configuration Program
(FCCP). Hardware vendors will certify complete cluster configurations
against this program, making the process of choosing a
cluster-compatible hardware solution much simpler than in the past.
|
4. Quorum models
A fundamental aspect of
Windows Server clustering is the process whereby each node is assigned
unambiguous ownership of a resource. For example, consider a two-node
cluster like the one you saw in figure 1.
If the network link between the two cluster nodes temporarily drops,
what process prevents both servers from assuming that they're now the
owner of the SQL instance? This process to resolve such an occurrence
(commonly called the split brain problem) is referred to as the cluster quorum.
In earlier versions
of Windows clustering (prior to Windows Server 2003), cluster quorum
was maintained using a shared disk resource and a quorum database
containing the resources and owners. In our previous example, a link
failure between the two nodes would be resolved by only one of the nodes
having ownership of the quorum disk. When the link is dropped, the node
with quorum disk ownership continues its role and takes on the roles
(if any) of the other node.
Despite the simplicity
and effectiveness of this quorum model, the quorum disk was a single
point of failure. Further, given the need for all cluster nodes to have
access to the shared disk resource containing the quorum disk, the
constraints of typical shared storage hardware prevented the creation of
geographically dispersed clusters. Windows Server 2003 addressed this
with the introduction of the Majority Node Set (MNS) quorum.
Nodes in an MNS quorum
cluster operate with local copies of the quorum database, avoiding the
limitations of shared storage, but in order to prevent the split brain
problem, a majority of nodes must remain in contact for the cluster to
be considered valid. One of the key attributes of an MNS cluster is the
requirement for a minimum of three nodes to form a valid cluster, and in
the case of a five-node cluster, a majority of the nodes (three) would
need to remain in contact for the cluster to continue operating.
Windows Server 2008 further enhances the quorum model with a voting system. As you can see in figure 3, a number of options are available, with a recommended option based on the number of cluster nodes.
In summary, the following general recommendations apply to the selection of a quorum model in 2008:
Node Majority—Used for clusters with an odd number of nodes
Node and Disk Majority—Used for clusters with shared storage and an even number of cluster nodes
Node and File Share Majority—Designed for multisite and Microsoft Exchange clusters
No Majority—Compatible with earlier quorum models; isn't recommended given the single point of failure on the quorum disk
Before
we look at some of the more important preinstallation configuration
tasks, let's take a moment to dig a little deeper and review common
clustering topologies.