After you have eliminated all single points of failure, ensured you have
sufficient redundancy for the system, and made plans for every
contingency, you should be ready for the last step.
All your resources and careful planning are of no use unless you can
wield them properly. You can usually manage a small site with a few
servers manually with very little planning, but as the number of servers
increases, automation becomes a necessity—and if you run a successful
business, the number of servers might have to increase quickly.
You’re likely better off if you plan from day one to have
automation—if you have to grow, you will be busy handling other matters
and will probably not have time to create the necessary automation
support.
Some of the basic procedures have already been discussed, but you
need to consider having ready-made procedures for at least the following tasks:
Adding new slaves
Creating new slaves when you need to scale is the basis for running
a big site. There are several options for creating new slaves. They
all circle around methods for taking a snapshot of an existing server, usually a slave,
restoring the snapshot on a new server, and then starting
replication from the correct position.
The time for taking a snapshot will, of course, affect how
quickly you can bring the new slave up; if the backup time is too
long, the master may have issued a lot of changes, which means that
the new slave will take longer to catch up. For this reason, the
snapshot time is important. Figure 1 shows the
snapshot time when the slave has caught up. You can see that when
the slave is stopped to take a snapshot, the changes will start to
accumulate, which will cause the outstanding changes to increase.
Once the slave is restarted, it will start to apply the outstanding
changes and the number of outstanding changes will decrease.
Some different methods of taking a snapshot include the
following:
Using mysqldump
Using mysqldump is
safe but slow. It allows you to restore the data using a
different storage engine than that on the original. If you
use InnoDB tables, you can take a consistent
snapshot, meaning you do not have to bring the server
offline.
Copying the database files
This is relatively fast, but requires you to bring the
server offline before copying the files.
Using an online backup method
There are different methods available, such as
the InnoDB Hot Backup.
Using LVM to get a snapshot
On Linux, it is possible to take a snapshot of a volume
using Logical Volume Manager (LVM). It does require that you
prepare beforehand, since a special LVM volume has to be
created.
Using filesystem snapshot methods
The Solaris ZFS, for example, has built-in support for taking
snapshots. This is a very fast technique for creating backups,
but it is similar to the other techniques above (except for
mysqldump). This means that you cannot
restore into a different engine than the one you used to take
the server snapshot.
Removing slaves from the topology
Removing slaves from the setup only requires notifying the load
balancer that the slave is absent.
Switching the master
For routine maintenance, it is common to have to switch all the
slaves of a master over to a secondary master as well as notify load
balancers of the master’s absence. This procedure can and should be
handled with no downtime at all, since it should not affect normal
operations.
Using slave promotion is
one way to handle this, but it might be easier to use a hot standby
instead .
Handling slave failures
Your slaves will fail—it is just a matter of how often.
Handling slave failures must be a routine event in any deployment.
Handling master failures
When the master goes down suddenly, you have to detect the
failure and move all the slaves over to a standby, or promote one of
the slaves to be the new master.
Upgrading slaves
Upgrading slaves to new versions of the server should usually
not be a problem. However, bringing the slave out of the system for
the upgrade requires removing it from the load balancer and maybe
notifying other systems of the slave’s absence.
Upgrading masters
To upgrade the master, it is usually necessary to upgrade all the
slaves first. However, this may not always be the case. To upgrade
the master, it is usually necessary to either use a standby as a
master while you are performing the upgrade or promote one of the
slaves to be the master for the duration of the upgrade.
The MySQL IT team is a versatile and very dedicated group of people, able to
handle all kinds of systems and equipment. Unlike many other IT teams I
have met over the years, these guys are comfortable handling the complex
array of computers that MySQL has accumulated over the years—everything
from high-end Windows machines to very old SGI Irix and HP-UX
machines—keeping it all together with whatever means they have at their
disposal.
The data center has been invaluable for testing the MySQL server
on many different kinds of machines, but as MySQL began growing, it
started to get quite cramped in there. So, the IT team prepared a new
data center in better—and more expensive—facilities in Stockholm. The
move was planned for the weekend, but events took a nasty turn in the
week just before the planned move.
I usually work from home, but this particular day I needed to go
to the office in Uppsala for some project meetings. I noticed that the
mysql.com site
was down that morning, but left for the office anyway, hoping for the
best.
Arriving at the office in Uppsala, I met part of the MySQL IT team
stringing power cords together, all going into the data center.
Apparently, the entire building had lost all power, but the nearby
buildings still had power, so the IT team had strung a long line of
power cords together to get power to some critical development
servers.
The power outage was quite severe and when I arrived there, the
UPS had been depleted. The power grid to the building had been damaged
and since the engineers from the electricity company could not say when
the problem would be fixed, part of the IT team had decided to relocate
the web stack immediately, basically taking whatever cars they could
find and driving the machines with the web stack to the new data center
in Stockholm, approximately 100 kilometers south.
After getting the web stack online in Stockholm, the mysql.com site was
restored, but there was a lot of work to be done to restore the
development machines using whatever power was available until the power
grid was repaired. The group worked for 48 hours with almost no sleep,
after which the machines were fully restored and operational—more than
could be said about the IT team.
MySQL was back on track again….