1. Redundancy
To understand where redundancy might be needed, you have to
identify every potential point of failure in the deployment. Even though
it sounds easy—not to mention a tad tedious and boring—it requires some
imagination to ensure that you really have found them all. Switches,
routers, network cards, and even network cables are single points of
failure. Outside of your architecture, but no less important, are power
sources and physical facilities. But what about services needed to keep
the deployment up? Suppose all network management is consolidated in a
web-based interface? Or what if you have only one staff person who knows
how to handle some types of failure?
Identifying the points of failure does not necessarily mean that you
have to eliminate them all. Sometimes it is just not possible for
economical, technical, or geographic reasons, but being aware of them
helps you with planning.
Some things that you should consider, or at least make a conscious
decision about whether to consider, are cost of duplicating components,
the probability of failure for different components, the time to replace a
component, and risk exposure while repairing a component. If repairing a
component takes a week and you are running with the spare as the single
point of failure during this time, you are taking a certain risk that the
spare could be lost as well, which may or may not be acceptable.
Once you have identified where you need redundancy, you have to
choose between two fundamental alternatives: you can either keep
duplicates around for each component—ready to take over immediately if the
original component should fail—or you can ensure you have extra capacity
in the system so that if a component fails, you can still handle the load.
This choice does not have to be made in an all-or-nothing fashion: you can
combine the two techniques so that you duplicate some components and use
extra capacity for some other parts of the system.
On the surface, the easiest approach is to duplicate components, but
duplication is expensive. You have to leave a standby around and keep it
up-to-date with the main component all the time. The advantages of
duplicating components are that you do not lose performance when switching
and that switching to the standby is usually faster than restructuring the
system, which you would have to do if you approached the problem by
creating spare capacity.
Creating spare capacity lets you use all the components for running
the business, possibly allowing you to handle higher peaks in your load.
When a component breaks, you restructure the system so that all remaining
components are in use. It is, however, important to have more capacity
than you normally need.
To understand why, consider a simple case where you have a master
that handles the writes—actually, you should have two, since you need to
have redundancy—with a set of slaves connected to the master whose only
purpose is to serve read requests.
Should one of the slaves fail, the system will still be responding,
but the capacity of the system will be reduced. If you have 10 slaves,
each running at 50 percent capacity, the failure of one slave will
increase the load on each slave to 55 percent, which is easy to handle.
However, if the slaves are running at 95 percent capacity and one of the
slaves fails, each server would have to handle 105 percent of the original
load to handle the same load, which is clearly not possible. In this case,
the read capacity of the system will be reduced and the response time will
be longer.
And planning for the loss of one server is not sufficient: you have
to consider the probability of losing more than one server and prepare for
that situation as well. Continuing with our previous example, even if each
server is running at 80 percent capacity, the system will be able to
handle the loss of one server. However, the loss of two servers means that
the load on each remaining server will increase to 100 percent, leaving
you with no room for unexpected bursts in traffic. If this occurs once a
year, it might be manageable, but you have to know how often it is likely
to happen.
Table 1 gives example
probabilities for losing 1, 2, or 3 servers in a setup of 100 servers,
given different probabilities of losing a single server. As you can see,
with a 1 percent probability of losing a server, you have a 16 percent
risk of losing three or more servers. If you are not prepared to handle
that, you’re in for some problems if it actually happens.
Note:
For a stochastic variable X representing the
number of servers lost, the probabilities are calculated using the
binomial tail distribution:
Table 1. Probabilities of losing servers
Probability of losing a
single server | 1 | 2 | 3 |
---|
1.00% | 100.00% | 49.50% | 16.17% |
0.50% | 50.00% | 12.38% | 2.02% |
0.10% | 10.00% | 0.50% | 0.02% |
To avoid such a situation, you have to monitor the deployment
closely to know what the load is, figure out the capacity of your system
through measurements, and do your math to see where the response times
will start to suffer.
2. Planning
Having redundancy is not sufficient; you also need to have plans
for what to do when the components fail. In the previous example, it is
easy to handle a failing slave, since new connections will be redirected
to the working slaves, but consider the following:
What happens with the existing connections? Just aborting and
returning an error message to the user is probably not a good idea.
Typically, there is an application layer between the user and the
database, so in this case the application layer has to retry the query
with another server.
What happens if the master fails? In the previous example, only
the slaves failed, but the master can also fail. Assuming you have
added redundancy by keeping an extra master around, you must also have plans for moving
all the slaves over to the new master.
2.1. Slave Failures
By far, the easiest failures to handle are slave failures. Since the slaves are only
used for read queries, it is sufficient to inform the load balancer that
the slave is missing, which will direct new queries to the functioning
slaves. There have to be enough slaves to handle the reduced capacity of
the system, but apart from that, a failing slave does not normally
affect the replication topology and there are no specific topologies
that you need to consider to make slave failure easier to manage.
When a slave has failed, there are inevitably some queries that
have been sent to the slave that are waiting for a reply. Once these
connections report an error resulting from a lost server, the queries
have to be repeated with a functioning slave.
2.2. Master Failures
If the master fails, it has to be replaced to keep the deployment up,
and it has to be replaced quickly. The moment the master fails, all
write queries will be aborted, so the first thing to do is to get a new
master available and direct all clients to it.
Since the main master failed, all the slaves are now without a
master as well, meaning that all the slaves have stale data, but they
are still up and can reply to read queries.
However, some queries may block if they are waiting for changes to
arrive at the slave. Some queries may make it into the relay log of the
slave and therefore will eventually be executed by the slave. No special
consideration has to be taken on the behalf of these queries.
For queries that are waiting for events that did not leave the
master before it crashed, the situation is bleaker. In this case, it is
necessary to ensure they are handled. This usually means they are
reported as failures, so the user will have to reissue the query.
2.3. Relay Failures
For servers acting as relay servers, the situation has to be handled specially.
If they fail, the remaining slaves have to be redirected to use some
other relay or the master itself. Since the relay has been added to
relieve the master of some load, it is likely that the master will not
be able to handle the load of a batch of slaves connected to one of its
relays.
2.4. Disaster Recovery
In the world of high availability, “disaster” does not have to mean
earthquakes or floods; it just means that something went very bad for
the computer and it is not local to the machine that failed.
Typical examples are lost power in the data center—not necessarily
because the power was lost in the city; just losing power in the
building is sufficient.
The nature of a disaster is that many things fail at once, making
it impossible to handle redundancy by duplicating servers at a single
data center. Instead, it is necessary to ensure data is kept safe at
another geographic location, and it is quite common for companies to
have different components at different offices, even when the company is
relatively small.