My SQL : Replication for High Availability - Procedures (part 2) - Hot Standby

11/24/2013 6:33:48 PM

1. Hot Standby

The easiest of the topologies for duplicating servers is the hot standby topology. This topology is shown in Figure 2 and consists of the master and a dedicated server called a hot standby that duplicates the main master. The hot standby server is connected to the master as a slave, and it reads and applies all changes.

Figure 2. Master with a hot standby

The idea is that when the main master fails, the hot standby provides a faithful replica of the master, and all the clients and slaves can therefore be switched over to the hot standby and continue operating. As with many ideas, the reality is not always that rosy.

Failure is inevitable, at least when you run a large deployment. It is not a question of if servers fail, but when and how often they fail. If the master fails for any reason, it should not bring the deployment to a halt. To ensure operations proceed, it is necessary to have a hot standby server available and to redirect all slaves to the hot standby when the main master fails. This will give you a chance to check what happened with the main master, and maybe fix it or replace it. After you have repaired the master, you have to bring it back on track and either set it to be the hot standby, or redirect the slaves to the original master again.

Sounds simple, doesn’t it? Ah, if only it was that easy—unfortunately, you have the following potential problems to ponder:

When failing over to the hot standby, you are replicating from a new master, so it will be necessary to translate the binlog positions from those of the original master to those of the hot standby.
When failing over a slave to a hot standby, the hot standby might actually not have all the changes that the slave has.
When bringing the repaired master back into the configuration, the repaired master might have changes in the binary log that never left the server.

All these are relevant issues, but for starters, let’s just consider the simpler case illustrated in Figure 3: that of performing a switchover from a running master to a standby in order to, for example, perform maintenance on the original master. In this case, the master is still running, so the situation becomes a lot simpler, since we can control the master and make it work for us instead of against us. We will later consider how to handle the case when the master just goes down because its software crashed, a frustrated cow⁠orker decided to kick the server, or the janitor tripped over the power cord.

Figure 3. Switching over from a running master to a standby

By default, events executed by the slave thread are not logged to the binary log, which turns out to be a problem if the slave is a standby waiting to be used as a master. In this case, it is necessary to have all the changes sent by the master to the standby written to the binary log of the standby—if not, there will be nothing to replicate. To configure the standby server for this, add the option log-slave-updates to the my.cnf file. This option ensures that statements received from the master and executed are also written to the slave’s binary log.

[mysqld]
user            = mysql
pid-file        = /var/run/mysqld/mysqld.pid
socket          = /var/run/mysqld/mysqld.sock
port            = 3306
basedir         = /usr
datadir         = /var/lib/mysql
tmpdir          = /tmp
log-bin         = master-bin
log-bin-index   = master-bin.index
server-id       = 1
log-slave-updates

After updating the options file, restart the server.

The main problem with switching over to a standby in this case is to perform the switchover in such a way that replication starts at the standby precisely where it stopped replicating on the master. If the positions were easy to translate—for example, if the positions were the same on both the master and the standby—we would not have a problem. Unfortunately, the positions may be different on the master and the standby for a number of reasons. The most common case is when the standby was not attached to the master when the master was started, but even if that is done, events cannot be guaranteed to be written the same way to the binary log on the standby as they were written to the binary log on the master.

The basic idea for performing the switchover is to stop the slave and the standby at exactly the same position and then just redirect the slave to the standby. Because the standby hasn’t made any changes after the position where you stopped it, you can just check the binlog position on the standby and direct the slave to start at that position. However, just stopping the slave and the standby will not guarantee that they are synchronized, so you have to do this manually.

To do this, stop both the slave and the standby and compare the binlog positions. Since both positions refer to positions on the same master—the slave and standby are both connected to the same master—you can check the positions just by comparing the filename and the byte position lexicographically (in that order).

standby> SHOW SLAVE STATUS\G
     ...
 Relay_Master_Log_File: master-bin.000096
     ...
   Exec_Master_Log_Pos: 756648
1 row in set (0.00 sec)

slave> SHOW SLAVE STATUS\G
     ...
 Relay_Master_Log_File: master-bin.000096
     ...
   Exec_Master_Log_Pos: 743456
 1 row in set (0.00 sec)

In this case, the standby is ahead of the slave, so just write down the slave position of the standby and start the slave to run until it has caught up with the standby. To have the slave catch up with the standby and stop at the right position:

slave> START SLAVE UNTIL
     ->   MASTER_LOG_FILE = 'master-bin.000096',
     ->   MASTER_LOG_POS =  756648;
Query OK, 0 rows affected (0.18 sec)

slave> SELECT MASTER_POS_WAIT('master-bin.000096',  756648);
Query OK, 0 rows affected (1.12 sec)

The slave and standby have now stopped at exactly the same position, and everything is ready to do the switchover to the standby using CHANGE MASTER TO to direct the slave to the standby and start it. But what position should you specify? Since the file and position that the master recorded for its stopping point are different from the file and position recorded by the standby for the same point, it is necessary to fetch the position that the standby recorded while recording the changes as a master. To do this, execute SHOW MASTER STATUS on the standby:

standby> SHOW MASTER STATUS\G 
*************************** 1. row ***************************
            File: standby-bin.000019 
        Position: 56447 
    Binlog_Do_DB:
Binlog_Ignore_DB:
1 row in set (0.00 sec)

Now you can redirect the slave to the standby using the correct position:

slave> CHANGE MASTER TO
     ->   MASTER_HOST = 'standby-1',
     ->   MASTER_PORT = 3306,
     ->   MASTER_USER = 'repl_user',
     ->   MASTER_PASSWORD = 'xyzzy',
     ->   MASTER_LOG_FILE = '
standby-bin.000019',
     ->   MASTER_LOG_POS = 56447;
  Query OK, 0 rows affected (0.18 sec)

 slave> START SLAVE;
Query OK, 0 rows affected (0.25 sec)

If the opposite is true—that the slave is ahead of the standby—you can just switch the roles of the standby and the slave in the previous steps. This is possible since the master is running and can provide either the slave or the standby with the missing changes. In the next section, we will consider how to handle the situation in which the master has stopped unexpectedly and hence cannot provide either the slave or the standby with the missing changes.

1.1. Handling a switchover in Python

Example 1 shows the Python code for switching a slave over to another master. The replicate_to_position function instructs a server to read from the master only to the given position. When the procedure returns, the slave will have stopped at exactly this position. The switch_to_master directs a slave to a new master. The procedure assumes that both the server on which it executes and the new master are connected to the same original master. If they are not, the positions are not comparable and the procedure will raise an exception.

Example 1. Procedure for switching to a new master

def replicate_to_position(server, pos):
    server.sql("START SLAVE UNTIL MASTER_LOG_FILE=%s, MASTER_LOG_POS=%s",
               (pos.file, pos.pos))
    server.sql("SELECT MASTER_POS_WAIT(%s,%s)", (pos.file, pos.pos))


def switch_to_master(server, standby):
    stop_slave(server)
    stop_slave(standby)
    server_pos = fetch_slave_position(server)
    standby_pos = fetch_slave_position(standby)
    if server_pos < standby_pos:
        replicate_to_position(server, standby_pos)
    elif server_pos > standby_pos:
        replicate_to_position(standby, server_pos)
    master_pos = fetch_master_position(standby)
    change_master(server, standby, master_pos)
    start_slave(standby)
    start_slave(server)

My SQL : Replication for High Availability - Procedures (part 8) - Circular Replication

My SQL : Replication for High Availability - Procedures (part 7) - Slave Promotion - Slave promotion in Python

My SQL : Replication for High Availability - Procedures (part 6) - Slave Promotion - A revised method for promoting a slave

My SQL : Replication for High Availability - Procedures (part 5) - Semisynchronous Replication - Configuring semisynchronous replication

My SQL : Replication for High Availability - Procedures (part 4) - Dual Masters - Replicated disks using DRBD , Bidirectional replication

My SQL : Replication for High Availability - Procedures (part 3) - Dual Masters - Shared disks

My SQL : Replication for High Availability - Procedures (part 1) - The mysql.com Outage

Other

My SQL : Replication for High Availability - Redundancy, Planning

SQL Server 2012 : Exploring SQL CLR - Deployment (part 2) - Deploying Your Stored Procedures, Testing Your Stored Procedures

SQL Server 2012 : Exploring SQL CLR - Deployment (part 1) - Deploying Your Assembly

SQL Server 2012 : Exploring SQL CLR - Your First SQL CLR Stored Procedure, CLR Stored Procedures and Server-Side Data Access

SQL Server 2012 : Exploring SQL CLR - Visual Studio/SQL Server Integration

SQL Server 2008 : DBCC validation - Removing corruption

SQL Server 2008 : Mirroring in action (part 3) - Suspending and resuming mirroring, Initiating failover

SQL Server 2008 : Mirroring in action (part 2) - Monitoring database mirroring

SQL Server 2008 : Mirroring in action (part 1) - Mirroring setup

SQL Server 2008 : High availability with database mirroring - Failover options