1. Hot Standby
The easiest of the topologies for duplicating servers is the hot standby
topology. This topology is shown in Figure 2 and consists of the master and a
dedicated server called a hot standby that duplicates the main master.
The hot standby server is connected to the master as a slave, and it
reads and applies all changes.
The idea is that when the main master fails, the hot standby
provides a faithful replica of the master, and all the clients and
slaves can therefore be switched over to the hot standby and continue
operating. As with many ideas, the reality is not always that
rosy.
Failure is inevitable, at least when you run a large deployment.
It is not a question of if servers fail, but
when and how often they fail.
If the master fails for any reason, it should not bring the deployment
to a halt. To ensure operations proceed, it is necessary to have a hot
standby server available and to redirect all slaves to the hot standby
when the main master fails. This will give you a chance to check what
happened with the main master, and maybe fix it or replace it. After you
have repaired the master, you have to bring it back on track and either
set it to be the hot standby, or redirect the slaves to the original
master again.
Sounds simple, doesn’t it? Ah, if only it was that
easy—unfortunately, you have the following potential problems to
ponder:
When failing over to the hot standby, you are replicating from
a new master, so it will be necessary to translate the binlog
positions from those of the original master to those of the hot
standby.
When failing over a slave to a hot standby, the hot standby
might actually not have all the changes that the slave has.
When bringing the repaired master back into the configuration,
the repaired master might have changes in the binary log that never
left the server.
All these are relevant issues, but for starters, let’s just
consider the simpler case illustrated in Figure 3: that of performing
a switchover from a running master to
a standby in order to, for example, perform maintenance on the original
master. In this case, the master is still running, so the situation
becomes a lot simpler, since we can control the master and make it work
for us instead of against us. We will later consider how to handle the
case when the master just goes down because its software crashed, a
frustrated coworker decided to kick the server, or the janitor tripped
over the power cord.
By default, events executed by the slave thread are
not logged to the binary log, which turns out to be
a problem if the slave is a standby waiting to be used as a master. In
this case, it is necessary to have all the changes sent by the master to
the standby written to the binary log of the standby—if not, there will
be nothing to replicate. To configure the standby server for this, add
the option log-slave-updates to
the my.cnf file. This option
ensures that statements received from the master and executed are also
written to the slave’s binary log.
[mysqld]
user = mysql
pid-file = /var/run/mysqld/mysqld.pid
socket = /var/run/mysqld/mysqld.sock
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
log-bin = master-bin
log-bin-index = master-bin.index
server-id = 1
log-slave-updates
After updating the options file, restart the server.
The main problem with switching over to a standby in this case is
to perform the switchover in such a way that replication starts at the
standby precisely where it stopped replicating on the master. If the
positions were easy to translate—for example, if the positions were the
same on both the master and the standby—we would not have a problem.
Unfortunately, the positions may be different on the master and the
standby for a number of reasons. The most common case is when the
standby was not attached to the master when the master was started, but
even if that is done, events cannot be guaranteed to be written the same
way to the binary log on the standby as they were written to the binary
log on the master.
The basic idea for performing the switchover is to stop the slave
and the standby at exactly the same position and then just redirect the
slave to the standby. Because the standby hasn’t made any changes after
the position where you stopped it, you can just check the binlog
position on the standby and direct the slave to start at that position.
However, just stopping the slave and the standby will not guarantee that
they are synchronized, so you have to do this manually.
To do this, stop both the slave and the standby and compare the
binlog positions. Since both positions refer to positions on the same
master—the slave and standby are both connected to the same master—you
can check the positions just by comparing the filename and the byte
position lexicographically (in that order).
standby> SHOW SLAVE STATUS\G
...
Relay_Master_Log_File: master-bin.000096
...
Exec_Master_Log_Pos: 756648
1 row in set (0.00 sec)
slave> SHOW SLAVE STATUS\G
...
Relay_Master_Log_File: master-bin.000096
...
Exec_Master_Log_Pos: 743456
1 row in set (0.00 sec)
In this case, the standby is ahead of the slave, so just write
down the slave position of the standby and start the slave to run until
it has caught up with the standby. To have the slave catch up with the
standby and stop at the right position:
slave> START SLAVE UNTIL
-> MASTER_LOG_FILE = 'master-bin.000096',
-> MASTER_LOG_POS = 756648;
Query OK, 0 rows affected (0.18 sec)
slave> SELECT MASTER_POS_WAIT('master-bin.000096', 756648);
Query OK, 0 rows affected (1.12 sec)
The slave and standby have now stopped at exactly the same
position, and everything is ready to do the switchover to the standby
using CHANGE MASTER TO to
direct the slave to the standby and start it. But what position should
you specify? Since the file and position that the master recorded for
its stopping point are different from the file and position recorded by
the standby for the same point, it is necessary to fetch the position
that the standby recorded while recording the changes as a master. To do
this, execute SHOW MASTER STATUS on
the standby:
standby> SHOW MASTER STATUS\G
*************************** 1. row ***************************
File: standby-bin.000019
Position: 56447
Binlog_Do_DB:
Binlog_Ignore_DB:
1 row in set (0.00 sec)
Now you can redirect the slave to the standby using the correct
position:
slave> CHANGE MASTER TO
-> MASTER_HOST = 'standby-1',
-> MASTER_PORT = 3306,
-> MASTER_USER = 'repl_user',
-> MASTER_PASSWORD = 'xyzzy',
-> MASTER_LOG_FILE = '
standby-bin.000019',
-> MASTER_LOG_POS = 56447;
Query OK, 0 rows affected (0.18 sec)
slave> START SLAVE;
Query OK, 0 rows affected (0.25 sec)
If the opposite is true—that the slave is ahead of the standby—you
can just switch the roles of the standby and the slave in the previous
steps. This is possible since the master is running and can provide
either the slave or the standby with the missing changes. In the next
section, we will consider how to handle the situation in which the
master has stopped unexpectedly and hence cannot provide either the
slave or the standby with the missing changes.
1.1. Handling a switchover in Python
Example 1 shows
the Python code for switching a slave over to another
master. The replicate_to_position
function instructs a server to read from the master only to the given
position. When the procedure returns, the slave will have stopped at
exactly this position. The switch_to_master directs a slave to a new
master. The procedure assumes that both the server on which it
executes and the new master are connected to the same original master.
If they are not, the positions are not comparable and the procedure
will raise an exception.
Example 1. Procedure for switching to a new master
def replicate_to_position(server, pos):
server.sql("START SLAVE UNTIL MASTER_LOG_FILE=%s, MASTER_LOG_POS=%s",
(pos.file, pos.pos))
server.sql("SELECT MASTER_POS_WAIT(%s,%s)", (pos.file, pos.pos))
def switch_to_master(server, standby):
stop_slave(server)
stop_slave(standby)
server_pos = fetch_slave_position(server)
standby_pos = fetch_slave_position(standby)
if server_pos < standby_pos:
replicate_to_position(server, standby_pos)
elif server_pos > standby_pos:
replicate_to_position(standby, server_pos)
master_pos = fetch_master_position(standby)
change_master(server, standby, master_pos)
start_slave(standby)
start_slave(server)