My SQL : Replication for High Availability - Redundancy, Planning

11/24/2013 6:24:39 PM

1. Redundancy

To understand where redundancy might be needed, you have to identify every potential point of failure in the deployment. Even though it sounds easy—not to mention a tad tedious and boring—it requires some imagination to ensure that you really have found them all. Switches, routers, network cards, and even network cables are single points of failure. Outside of your architecture, but no less important, are power sources and physical facilities. But what about services needed to keep the deployment up? Suppose all network management is consolidated in a web-based interface? Or what if you have only one staff person who knows how to handle some types of failure?

Identifying the points of failure does not necessarily mean that you have to eliminate them all. Sometimes it is just not possible for economical, technical, or geographic reasons, but being aware of them helps you with planning.

Some things that you should consider, or at least make a conscious decision about whether to consider, are cost of duplicating components, the probability of failure for different components, the time to replace a component, and risk exposure while repairing a component. If repairing a component takes a week and you are running with the spare as the single point of failure during this time, you are taking a certain risk that the spare could be lost as well, which may or may not be acceptable.

Once you have identified where you need redundancy, you have to choose between two fundamental alternatives: you can either keep duplicates around for each component—ready to take over immediately if the original component should fail—or you can ensure you have extra capacity in the system so that if a component fails, you can still handle the load. This choice does not have to be made in an all-or-nothing fashion: you can combine the two techniques so that you duplicate some components and use extra capacity for some other parts of the system.

On the surface, the easiest approach is to duplicate components, but duplication is expensive. You have to leave a standby around and keep it up-to-date with the main component all the time. The advantages of duplicating components are that you do not lose performance when switching and that switching to the standby is usually faster than restructuring the system, which you would have to do if you approached the problem by creating spare capacity.

Creating spare capacity lets you use all the components for running the business, possibly allowing you to handle higher peaks in your load. When a component breaks, you restructure the system so that all remaining components are in use. It is, however, important to have more capacity than you normally need.

To understand why, consider a simple case where you have a master that handles the writes—actually, you should have two, since you need to have redundancy—with a set of slaves connected to the master whose only purpose is to serve read requests.

Should one of the slaves fail, the system will still be responding, but the capacity of the system will be reduced. If you have 10 slaves, each running at 50 percent capacity, the failure of one slave will increase the load on each slave to 55 percent, which is easy to handle. However, if the slaves are running at 95 percent capacity and one of the slaves fails, each server would have to handle 105 percent of the original load to handle the same load, which is clearly not possible. In this case, the read capacity of the system will be reduced and the response time will be longer.

And planning for the loss of one server is not sufficient: you have to consider the probability of losing more than one server and prepare for that situation as well. Continuing with our previous example, even if each server is running at 80 percent capacity, the system will be able to handle the loss of one server. However, the loss of two servers means that the load on each remaining server will increase to 100 percent, leaving you with no room for unexpected bursts in traffic. If this occurs once a year, it might be manageable, but you have to know how often it is likely to happen.

Table 1 gives example probabilities for losing 1, 2, or 3 servers in a setup of 100 servers, given different probabilities of losing a single server. As you can see, with a 1 percent probability of losing a server, you have a 16 percent risk of losing three or more servers. If you are not prepared to handle that, you’re in for some problems if it actually happens.


For a stochastic variable X representing the number of servers lost, the probabilities are calculated using the binomial tail distribution:

Table 1. Probabilities of losing servers
Probability of losing a single server123

To avoid such a situation, you have to monitor the deployment closely to know what the load is, figure out the capacity of your system through measurements, and do your math to see where the response times will start to suffer.

2. Planning

Having redundancy is not sufficient; you also need to have plans for what to do when the components fail. In the previous example, it is easy to handle a failing slave, since new connections will be redirected to the working slaves, but consider the following:

  • What happens with the existing connections? Just aborting and returning an error message to the user is probably not a good idea. Typically, there is an application layer between the user and the database, so in this case the application layer has to retry the query with another server.

  • What happens if the master fails? In the previous example, only the slaves failed, but the master can also fail. Assuming you have added redundancy by keeping an extra master around, you must also have plans for moving all the slaves over to the new master.

2.1. Slave Failures

By far, the easiest failures to handle are slave failures. Since the slaves are only used for read queries, it is sufficient to inform the load balancer that the slave is missing, which will direct new queries to the functioning slaves. There have to be enough slaves to handle the reduced capacity of the system, but apart from that, a failing slave does not normally affect the replication topology and there are no specific topologies that you need to consider to make slave failure easier to manage.

When a slave has failed, there are inevitably some queries that have been sent to the slave that are waiting for a reply. Once these connections report an error resulting from a lost server, the queries have to be repeated with a functioning slave.

2.2. Master Failures

If the master fails, it has to be replaced to keep the deployment up, and it has to be replaced quickly. The moment the master fails, all write queries will be aborted, so the first thing to do is to get a new master available and direct all clients to it.

Since the main master failed, all the slaves are now without a master as well, meaning that all the slaves have stale data, but they are still up and can reply to read queries.

However, some queries may block if they are waiting for changes to arrive at the slave. Some queries may make it into the relay log of the slave and therefore will eventually be executed by the slave. No special consideration has to be taken on the behalf of these queries.

For queries that are waiting for events that did not leave the master before it crashed, the situation is bleaker. In this case, it is necessary to ensure they are handled. This usually means they are reported as failures, so the user will have to reissue the query.

2.3. Relay Failures

For servers acting as relay servers, the situation has to be handled specially. If they fail, the remaining slaves have to be redirected to use some other relay or the master itself. Since the relay has been added to relieve the master of some load, it is likely that the master will not be able to handle the load of a batch of slaves connected to one of its relays.

2.4. Disaster Recovery

In the world of high availability, “disaster” does not have to mean earthquakes or floods; it just means that something went very bad for the computer and it is not local to the machine that failed.

Typical examples are lost power in the data center—not necessarily because the power was lost in the city; just losing power in the building is sufficient.

The nature of a disaster is that many things fail at once, making it impossible to handle redundancy by duplicating servers at a single data center. Instead, it is necessary to ensure data is kept safe at another geographic location, and it is quite common for companies to have different components at different offices, even when the company is relatively small.

  •  SQL Server 2012 : Exploring SQL CLR - Deployment (part 2) - Deploying Your Stored Procedures, Testing Your Stored Procedures
  •  SQL Server 2012 : Exploring SQL CLR - Deployment (part 1) - Deploying Your Assembly
  •  SQL Server 2012 : Exploring SQL CLR - Your First SQL CLR Stored Procedure, CLR Stored Procedures and Server-Side Data Access
  •  SQL Server 2012 : Exploring SQL CLR - Visual Studio/SQL Server Integration
  •  SQL Server 2008 : DBCC validation - Removing corruption
  •  SQL Server 2008 : Mirroring in action (part 3) - Suspending and resuming mirroring, Initiating failover
  •  SQL Server 2008 : Mirroring in action (part 2) - Monitoring database mirroring
  •  SQL Server 2008 : Mirroring in action (part 1) - Mirroring setup
  •  SQL Server 2008 : High availability with database mirroring - Failover options
  •  Programming ASP.NET 3.5 : Data Source-Based Data Binding (part 4) - Iterative Controls
    Top 10
    Review : Sigma 24mm f/1.4 DG HSM Art
    Review : Canon EF11-24mm f/4L USM
    Review : Creative Sound Blaster Roar 2
    Review : Philips Fidelio M2L
    Review : Alienware 17 - Dell's Alienware laptops
    Review Smartwatch : Wellograph
    Review : Xiaomi Redmi 2
    Extending LINQ to Objects : Writing a Single Element Operator (part 2) - Building the RandomElement Operator
    Extending LINQ to Objects : Writing a Single Element Operator (part 1) - Building Our Own Last Operator
    3 Tips for Maintaining Your Cell Phone Battery (part 2) - Discharge Smart, Use Smart
    - First look: Apple Watch

    - 3 Tips for Maintaining Your Cell Phone Battery (part 1)

    - 3 Tips for Maintaining Your Cell Phone Battery (part 2)
    - How to create your first Swimlane Diagram or Cross-Functional Flowchart Diagram by using Microsoft Visio 2010 (Part 1)

    - How to create your first Swimlane Diagram or Cross-Functional Flowchart Diagram by using Microsoft Visio 2010 (Part 2)

    - How to create your first Swimlane Diagram or Cross-Functional Flowchart Diagram by using Microsoft Visio 2010 (Part 3)
    Popular Tags
    Video Tutorail Microsoft Access Microsoft Excel Microsoft OneNote Microsoft PowerPoint Microsoft Project Microsoft Visio Microsoft Word Active Directory Exchange Server Sharepoint Sql Server Windows Server 2008 Windows Server 2012 Windows 7 Windows 8 Adobe Flash Professional Dreamweaver Adobe Illustrator Adobe Photoshop CorelDRAW X5 CorelDraw 10 windows Phone 7 windows Phone 8 Iphone