My SQL : Replication for High Availability - Procedures (part 1) - The Outage

11/24/2013 6:30:58 PM

After you have eliminated all single points of failure, ensured you have sufficient redundancy for the system, and made plans for every contingency, you should be ready for the last step.

All your resources and careful planning are of no use unless you can wield them properly. You can usually manage a small site with a few servers manually with very little planning, but as the number of servers increases, automation becomes a necessity—and if you run a successful business, the number of servers might have to increase quickly.

You’re likely better off if you plan from day one to have automation—if you have to grow, you will be busy handling other matters and will probably not have time to create the necessary automation support.

Some of the basic procedures have already been discussed, but you need to consider having ready-made procedures for at least the following tasks:

Adding new slaves

Creating new slaves when you need to scale is the basis for running a big site. There are several options for creating new slaves. They all circle around methods for taking a snapshot of an existing server, usually a slave, restoring the snapshot on a new server, and then starting replication from the correct position.

The time for taking a snapshot will, of course, affect how quickly you can bring the new slave up; if the backup time is too long, the master may have issued a lot of changes, which means that the new slave will take longer to catch up. For this reason, the snapshot time is important. Figure 1 shows the snapshot time when the slave has caught up. You can see that when the slave is stopped to take a snapshot, the changes will start to accumulate, which will cause the outstanding changes to increase. Once the slave is restarted, it will start to apply the outstanding changes and the number of outstanding changes will decrease.

Some different methods of taking a snapshot include the following:

Using mysqldump

Using mysqldump is safe but slow. It allows you to restore the data using a different storage engine than that on the original. If you use InnoDB tables, you can take a consistent snapshot, meaning you do not have to bring the server offline.

Copying the database files

This is relatively fast, but requires you to bring the server offline before copying the files.

Using an online backup method

There are different methods available, such as the InnoDB Hot Backup.

Using LVM to get a snapshot

On Linux, it is possible to take a snapshot of a volume using Logical Volume Manager (LVM). It does require that you prepare beforehand, since a special LVM volume has to be created.

Using filesystem snapshot methods

The Solaris ZFS, for example, has built-in support for taking snapshots. This is a very fast technique for creating backups, but it is similar to the other techniques above (except for mysqldump). This means that you cannot restore into a different engine than the one you used to take the server snapshot.

Figure 1. Outstanding changes when taking a snapshot

Removing slaves from the topology

Removing slaves from the setup only requires notifying the load balancer that the slave is absent.

Switching the master

For routine maintenance, it is common to have to switch all the slaves of a master over to a secondary master as well as notify load balancers of the master’s absence. This procedure can and should be handled with no downtime at all, since it should not affect normal operations.

Using slave promotion is one way to handle this, but it might be easier to use a hot standby instead .

Handling slave failures

Your slaves will fail—it is just a matter of how often. Handling slave failures must be a routine event in any deployment.

Handling master failures

When the master goes down suddenly, you have to detect the failure and move all the slaves over to a standby, or promote one of the slaves to be the new master.

Upgrading slaves

Upgrading slaves to new versions of the server should usually not be a problem. However, bringing the slave out of the system for the upgrade requires removing it from the load balancer and maybe notifying other systems of the slave’s absence.

Upgrading masters

To upgrade the master, it is usually necessary to upgrade all the slaves first. However, this may not always be the case. To upgrade the master, it is usually necessary to either use a standby as a master while you are performing the upgrade or promote one of the slaves to be the master for the duration of the upgrade.

The Outage

The MySQL IT team is a versatile and very dedicated group of people, able to handle all kinds of systems and equipment. Unlike many other IT teams I have met over the years, these guys are comfortable handling the complex array of computers that MySQL has accumulated over the years—everything from high-end Windows machines to very old SGI Irix and HP-UX machines—keeping it all together with whatever means they have at their disposal.

The data center has been invaluable for testing the MySQL server on many different kinds of machines, but as MySQL began growing, it started to get quite cramped in there. So, the IT team prepared a new data center in better—and more expensive—facilities in Stockholm. The move was planned for the weekend, but events took a nasty turn in the week just before the planned move.

I usually work from home, but this particular day I needed to go to the office in Uppsala for some project meetings. I noticed that the site was down that morning, but left for the office anyway, hoping for the best.

Arriving at the office in Uppsala, I met part of the MySQL IT team stringing power cords together, all going into the data center. Apparently, the entire building had lost all power, but the nearby buildings still had power, so the IT team had strung a long line of power cords together to get power to some critical development servers.

The power outage was quite severe and when I arrived there, the UPS had been depleted. The power grid to the building had been damaged and since the engineers from the electricity company could not say when the problem would be fixed, part of the IT team had decided to relocate the web stack immediately, basically taking whatever cars they could find and driving the machines with the web stack to the new data center in Stockholm, approximately 100 kilometers south.

After getting the web stack online in Stockholm, the site was restored, but there was a lot of work to be done to restore the development machines using whatever power was available until the power grid was repaired. The group worked for 48 hours with almost no sleep, after which the machines were fully restored and operational—more than could be said about the IT team.

MySQL was back on track again….

  •  My SQL : Replication for High Availability - Redundancy, Planning
  •  SQL Server 2012 : Exploring SQL CLR - Deployment (part 2) - Deploying Your Stored Procedures, Testing Your Stored Procedures
  •  SQL Server 2012 : Exploring SQL CLR - Deployment (part 1) - Deploying Your Assembly
  •  SQL Server 2012 : Exploring SQL CLR - Your First SQL CLR Stored Procedure, CLR Stored Procedures and Server-Side Data Access
  •  SQL Server 2012 : Exploring SQL CLR - Visual Studio/SQL Server Integration
  •  SQL Server 2008 : DBCC validation - Removing corruption
  •  SQL Server 2008 : Mirroring in action (part 3) - Suspending and resuming mirroring, Initiating failover
  •  SQL Server 2008 : Mirroring in action (part 2) - Monitoring database mirroring
  •  SQL Server 2008 : Mirroring in action (part 1) - Mirroring setup
  •  SQL Server 2008 : High availability with database mirroring - Failover options
    Top 10
    SG50 Ferrari F12berlinetta : Prancing Horse for Lion City's 50th
    The latest Audi TT : New angles for TT
    Era of million-dollar luxury cars
    Game Review : Hearthstone - Blackrock Mountain
    Game Review : Battlefield Hardline
    Google Chromecast
    Keyboards for Apple iPad Air 2 (part 3) - Logitech Ultrathin Keyboard Cover for iPad Air 2
    Keyboards for Apple iPad Air 2 (part 2) - Zagg Slim Book for iPad Air 2
    Keyboards for Apple iPad Air 2 (part 1) - Belkin Qode Ultimate Pro Keyboard Case for iPad Air 2
    Michael Kors Designs Stylish Tech Products for Women
    - First look: Apple Watch

    - 3 Tips for Maintaining Your Cell Phone Battery (part 1)

    - 3 Tips for Maintaining Your Cell Phone Battery (part 2)
    Popular Tags
    Video Tutorail Microsoft Access Microsoft Excel Microsoft OneNote Microsoft PowerPoint Microsoft Project Microsoft Visio Microsoft Word Active Directory Exchange Server Sharepoint Sql Server Windows Server 2008 Windows Server 2012 Windows 7 Windows 8 Adobe Flash Professional Dreamweaver Adobe Illustrator Adobe Photoshop CorelDRAW X5 CorelDraw 10 windows Phone 7 windows Phone 8 Iphone