If an NLB cluster is too
limited in functionality for you, you should investigate a true server
cluster. In a true server cluster, a group of machines have a single
identity and work in tandem to manage and, in the event of failure,
migrate applications away from problematic nodes and onto functional
nodes. The nodes of the cluster use a common, shared resource database
and log storage facility provided by a physical storage device that is
located on a hardware bus shared by all members of the cluster.
The
shared data facility does not support IDE disks, software RAID
(including Windows-based dynamic RAID), dynamic disks or volumes, the
EFS, mounted volumes and reparse points, or remote storage devices such
as tape backup drives. |
|
Three types of clusters
are supported by Windows Server 2003 in the Enterprise and Datacenter
editions of the product: single node clusters, which are useful in test
and laboratory environments to see if applications and resources
function in the manner intended but do not have any sort of
fault-tolerant functionality; single quorum device clusters, which are
the most common and most functional type of cluster used in production
because of their multiple nodes; and majority node set clusters, which
function as a cluster but without a shared physical storage device,
something required of the other two types. Majority node set clusters
are useful if you do not have a SCSI-based SAN or if the members of a
cluster are spread out over several different sites, making a shared
storage bus unfeasible. The Enterprise Edition supports up to four
cluster nodes; the Datacenter Edition supports up to eight.
Clusters manage failure
using failover and failback policies (that is, unless you are using a
single node cluster). Failover policies dictate the behavior of cluster
resources
when a failure occurs—which nodes
the failed resources can migrate to, the timing of a failover after the
failure, and other properties. A failback policy specifies what will
happen when the failed node comes back online again. How quickly should
the migrated resources and applications be returned to the original
node? Should the migrated objects stay at their new home? Should the
repaired node be ignored? You can specify all of this behavior through
policies.
1. Cluster Terminology
A few specific terms have special meanings when used in the context of clustering. They include the following:
Networks Networks, also called interconnects, are the ways in which clusters
communicate with other members (nodes) of the cluster and the public
network. The network is the most common point of failure in cluster
nodes; always make network cards redundant in a true server cluster.
Nodes Nodes are the
actual members of the cluster. The clustering service supports only
member nodes running Windows Server 2003 Enterprise Edition or
Datacenter Edition. Other requirements include the TCP/IP protocol,
connection to a shared storage device, and at least one interconnect to
other nodes.
Resources Resources are
simply anything that can be managed by the cluster service and that the
cluster can use to provide a service to clients. Resources can be
logical or physical and can represent real devices, network services, or
file system objects. A special type of physical disk resource called
the quorum disk
provides a place for the cluster service to store recovery logs and its
own database. I'll provide a list of some resources in the next
section.
Groups Resources can be collected into resource groups
, which are simply units by which failover and failback policy can be
specified. A group's resources all fail over and fail back according to a
policy applied to the group, and all the resources move to other nodes
together upon a failure.
Quorum A quorum is
the shared storage facility that keeps the cluster resource database and
logs. As noted earlier in this section, this needs to be a SCSI-based
real drive with no special software features.
2. Types of Resources
A variety of resources are
supported out of the box by the clustering service in Windows Server
2003. They include the following:
DHCP This type of
resource manages the DHCP service, which can be used in a cluster to
assure availability to client computers. The DHCP database must reside
on the shared cluster storage device, otherwise known as the quorum
disk.
File Share Shares on servers can be made redundant and fault-tolerant aside from using the Dfs service
by using the File Share resource inside a cluster. You can put shared
files and folders into a cluster as a standard file share with only one
level of folder visibility, as a shared subfolder system with the root
folder and all immediate subfolders shared with distinct names, or as a
standalone Dfs root.
Fault-tolerant Dfs roots cannot be placed within a cluster. |
|
Generic Application Applications that
are not cluster-aware (meaning they don't have their own fault tolerance
features that can hook into the cluster service) can be managed within a
cluster using the Generic Application resource. Applications managed in
this way must be able store any data they create in a custom location,
use TCP/IP to connect clients, and be able to receive clients attempting
to reconnect in the event of a failure. You can install a
cluster-unaware application onto the shared cluster storage device; that
way, you need to install the program only once and then the entire
cluster can use it.
Generic Script This resource type is
used to manage operating system scripts. You can cluster login scripts
and account provisioning scripts, for example, if you regularly use
those functions and need their continued availability even in the event
of a machine failure. Hotmail's account provisioning functions, for
instance, are a good fit for this feature, so users can sign up for the
service at all hours of the day.
Generic Service You can manage
Windows Server 2003 core services, if you require them to be highly
available, using the Generic Service resource type. Only the bundled
services are supported.
IP Address The IP Address resource manages a static, dedicated IP address assigned to a cluster.
Local Quorum This type of resource
is used to represent the disk shared by the cluster for activity logs
and the cluster resource database. Local quorums do not have failover
capabilities.
Majority Node Set The Majority Node
Set resource represents cluster configurations that don't reside on a
quorum disk. Because there is no quorum disk, particularly in instances
where the nodes of a cluster are in separate, geographically distinct
sites, there needs to be a mechanism by which the cluster nodes can stay
updated on the cluster configuration and the logs each node creates.
Only one Majority Node Set resource can be present within each cluster
as a whole. With a majority node set, you need 1/2n + 1 functioning
nodes for the cluster to be online, so if you have four members of the
cluster, three must be functioning.
Network Name The Network Name
resource represents the shared DNS or NetBIOS name of the cluster, an
application, or a virtual server contained within the cluster.
Physical Disk Physical Disk
resources manage storage devices that are shared to all cluster members.
The drive letter assigned to the physical device is the same on all
cluster nodes. The Physical Disk Resource is required by default for all
cluster types except the Majority Node Set.
Print Spooler Print
services can be clustered using the Print Spooler resource. This
represents printers attached directly to the network, not printers
attached directly to a cluster node's ports. Printers that are clustered
appear normally to clients, but in the event that one node fails, print
jobs on that node will be moved to another, functional node and then
restarted. Clients that are sending print jobs to the queue when a
failure occurs will be notified of the failure and asked to resubmit
their print jobs.
Volume Shadow Copy Service Task This resource
type is used to create shadow copy jobs in the Scheduled Task folder on
the node that currently owns the specified resource group hosting that
resource. You can use this resource only to provide fault tolerance for
the shadow copy process.
WINS The WINS resource
type is associated with the Windows Internet Naming Service, which maps
NetBIOS computer names to IP addresses. To use WINS and make it a
clustered service, the WINS database needs to reside on the quorum disk.
3. Planning a Cluster Setup
Setting up a server
cluster can be tricky, but you can take a lot of the guesswork out of
the process by having a clear plan of exactly what goals you are
attempting to accomplish by having a cluster. Are you interested in
achieving fault tolerance and load balancing at the same time? Do you
not care about balancing load but want your focus to be entirely on
providing five-nines service? Or would you like to provide only critical
fault tolerance and thereby reduce the expense involved in creating and
deploying the cluster?
If you are interested
in a balance between load balancing and high availability, you allow
applications and resources in the cluster to "fail over," or migrate, to
other nodes in the cluster in the event of a failure. The benefit is
that they continue to operate and are accessible to clients, but they
also increase the load among the remaining, functioning nodes of the
cluster. This load can cause cascading failures—as nodes continually
fail, the load on the remaining nodes increases to the point where their
hardware or software is unable to handle the load, causing those nodes
to fail, and the process continues until all nodes are dead—and that
eventuality really makes your fault-tolerant cluster immaterial. The
moral here is that you need to examine your application, and plan each
node appropriately to handle an average load plus an "emergency reserve"
that can handle increased loads in the event of failure. You also
should have policies and procedures to manage loads quickly when nodes
fail. This setup is shown in Figure 1.
If your be-all and end-all
goal is true high availability, consider running a cluster member as a
hot spare, ready to take over operations if a node fails. In this case,
you would specify that if you had n cluster nodes, the applications and resources in the cluster should run on n-1
nodes. Then, configure the one remaining node to be idle. In this
fashion, when failures occur the applications will migrate to the idle
node and continue functioning. A nice feature is that your hot spare
node can change, meaning there's not necessarily a need to migrate
failed-over processes to the previously failed node when it comes back
up—it can remain idle as the new hot spare. This reduces your management
responsibility a bit. This setup is shown in Figure 2.
Also consider a
load-shedding setup. In load shedding, you specify a certain set of
resources or applications as "critical" and those are automatically
failed over when one of your cluster nodes breaks. However, you also
specify another set of applications and resources as "non-critical."
These do not fail over. This type of setup helps prevent cascading
failures when load is migrated between cluster nodes because you shed
some of the processing time requirements in allowing non-critical
applications and resources to simply fail. Once repairs have been made
to the nonfunctional cluster node, you can bring up the non-critical
applications and resources and the situation will return to normal. This
setup is shown in Figure 3.
|