My SQL : Replication for High Availability - Procedures (part 8) - Circular Replication

11/24/2013 6:47:44 PM

5. Circular Replication

After reading about dual masters, you might wonder if it is possible to set up a multimaster with more than two masters replicating to each other. Since each slave can only have a single master, it is only possible to get this by setting up replication in a circular fashion.

Although this is not a recommended setup, it is certainly possible. The reason it is not recommended is because it is very hard to get it to work correctly in the presence of failure. The reasons for this will become clear in a moment.

Using a circular replication setup with three or more servers can be quite practical for reasons of locality. As a real-life example, consider the case of a mobile phone operator with subscribers all over Europe. Since the mobile phone owners travel around quite a lot, it is convenient to have the registry for the customers close to the actual phone, so by placing the data centers at some strategic places in Europe, it is possible to quickly verify call data and also register new calls locally. The changes can then be replicated to all the servers in the ring, and eventually all servers will have accurate billing information. In this case, circular replication is a perfect setup: all subscriber data is replicated to all sites, and updates of data are allowed in all data centers.

Setting up circular replication (as shown in Figure 10) is quite easy. Example 16 provides a script that sets up circular replication automatically, so where are the complications? As in every setup, you should ask yourself, “What happens when something goes wrong?”

Figure 10. Circular replication setup

Example 16. Setting up circular replication

def circular_replication(server_list):
    count = len(server_list)
    for i in range(0, count):
        change_master(server_list[(i+1) % count], server_list[i])

In Figure 10, there are four servers named for the cities in which they are located (the names are arbitrarily picked and do not reflect a real setup). Replication goes in a circle: “Stockholm” to “Moscow” to “Paris” to “London” and back to “Stockholm.” This means that “Moscow” is upstream of “Paris,” but downstream of “Stockholm.” Suppose that “Moscow” goes down suddenly and unexpectedly. To allow replication to continue, it is necessary to reconnect the “downstream” server “Paris” to the “upstream” server “Stockholm” to ensure the continuing operation of the system.

Figure 11 shows a scenario in which a single server fails and the servers reconnect to allow replication to continue. Sounds simple enough, doesn’t it? Well, it’s not really as simple as it looks. There are basically three issues that you have to consider:

The downstream server—the server that was slave to the failed master—needs to connect to the upstream server and start replication from what it last saw. How is that position decided?
Suppose that the crashed server has managed to send out some events before crashing. What happens with those events?
We need to consider how we should bring the failed server into the topology again. What if the server applied some transactions of its own that were written to the binary log but not yet sent out? It is clear that these transactions are lost, so we need to handle this.

Figure 11. Changing topology in response to a failing server

When detecting that one of the servers failed, it is easy to use the CHANGE MASTER command to connect the downstream server to the upstream server, but for replication to work correctly, we must determine the right position. To find the correct position, use binary log scanning techniques similar to what we used for slave promotion. However, in this case, we have several servers to consider when deciding what position to start from. The Last_Exec_Trans table introduced earlier already contains the server ID and the global transaction ID seen from that server.

The second problem is more complicated. If the failing server managed to send out an event, there is nothing that can remove that event from the replication stream, so it will circle around the replication topology forever. If the statement is idempotent—it can be reapplied multiple times without causing problems—the situation could be manageable for a short period, but in general, the statement has to be removed somehow.

In MySQL version 5.5, the parameter IGNORE_SERVER_IDS was added to the CHANGE MAS⁠TER command. This parameter allows a server to remove more events from the replication stream than just the events with the same server ID as the server. So, assuming that the servers have the IDs shown in Figure 11, we can reconnect Paris to Stockholm using the following command:

paris> CHANGE MASTER TO
    ->   MASTER_HOST='stockholm.example.com',
    ->   IGNORE_SERVER_IDS = (2);

For versions of MySQL earlier than version 5.5, there is no such support and you may have to devise some other means of removing the offending events. The easiest method is probably to bring in a server temporarily with the same ID as the crashed server for the sole purpose of removing the offending event.

The complete procedure to shrink the ring in a circular setup—assuming that you are using MySQL 5.5—is as follows:

Determine the global transaction IDs of the last committed transactions on the downstream server for all servers that are still up and running.

paris> SELECT Server_ID, Trans_ID FROM Last_Exec_Trans WHERE Server_ID != 2;
+-----------+----------+
| Server_ID | Trans_ID |
+-----------+----------+
|         1 |     5768 |
|         3 |     4563 |
|         4 |      768 |
+-----------+----------+
3 rows in set (0.00 sec)

Scan the binary log of the upstream server for the last of the global transaction IDs seen in Last_Exec_Trans.

Connect the downstream server to this position using CHANGE MASTER.

paris> CHANGE MASTER TO
    ->   MASTER_HOST='stockholm.example.com',
    ->   IGNORE_SERVER_IDS = (2);

Since the failed server can be in an alternative future compared to the other servers, the safest way to bring it into the circle again is to restore the server from one of the servers in the ring and reconnect the circle so that the new server is in the ring again. The steps to accomplish that are:

Restore the server from one of the existing servers—the server that will eventually be the upstream server—in the ring and attach it as a slave to that server.
```
moscow> CHANGE MASTER TO MASTER_HOST='stockholm.example.com';
Query OK, 0 rows affected (0.18 sec)

moscow> START SLAVE;
Query OK, 0 rows affected (0.00 sec)
```
Once the server has caught up sufficiently, break the ring by disconnecting the downstream server. This server will no longer receive any updates.
```
paris> STOP SLAVE;
Query OK, 0 rows affected (0.00 sec)
```

Since the restored server might not have all the events that the downstream server has, it is necessary to wait for the restored server to have at least all the events the downstream server has. Since the positions are for the same server, you can do this using a combination of SHOW SLAVE STATUS and MASTER_POS_WAIT.

paris> SHOW SLAVE STATUS;
    ...
 Relay_Master_Log_File: stockholm-bin.000096
     ...
  Exec_Master_Log_Pos: 756648
1 row in set (0.00 sec)



moscow> SELECT MASTER_POS_WAIT('stockholm-bin.000096', 756648);
+-------------------------------------------------+
| MASTER_POS_WAIT('stockholm-bin.000096', 756648) |
+-------------------------------------------------+
|                                          985761 |
+-------------------------------------------------+
1 row in set (156.32 sec)

Determine the position of the event on the restored server by scanning the binary log of the restored server for the global ID that was last seen by the downstream server.

Connect the downstream server to the restored server and start replication.

paris> CHANGE MASTER TO
    ->    MASTER_HOST='moscow.example.com',
    ->    MASTER_LOG_FILE='moscow-bin.000107',
    ->    MASTER_LOG_POS=196758,
Query OK, 0 rows affected (0.18 sec)

moscow> START SLAVE;
Query OK, 0 rows affected (0.00 sec)

My SQL : Replication for High Availability - Procedures (part 7) - Slave Promotion - Slave promotion in Python

My SQL : Replication for High Availability - Procedures (part 6) - Slave Promotion - A revised method for promoting a slave

My SQL : Replication for High Availability - Procedures (part 5) - Semisynchronous Replication - Configuring semisynchronous replication

My SQL : Replication for High Availability - Procedures (part 4) - Dual Masters - Replicated disks using DRBD , Bidirectional replication

My SQL : Replication for High Availability - Procedures (part 3) - Dual Masters - Shared disks

My SQL : Replication for High Availability - Procedures (part 2) - Hot Standby

My SQL : Replication for High Availability - Procedures (part 1) - The mysql.com Outage

Other

My SQL : Replication for High Availability - Redundancy, Planning

SQL Server 2012 : Exploring SQL CLR - Deployment (part 2) - Deploying Your Stored Procedures, Testing Your Stored Procedures

SQL Server 2012 : Exploring SQL CLR - Deployment (part 1) - Deploying Your Assembly

SQL Server 2012 : Exploring SQL CLR - Your First SQL CLR Stored Procedure, CLR Stored Procedures and Server-Side Data Access

SQL Server 2012 : Exploring SQL CLR - Visual Studio/SQL Server Integration

SQL Server 2008 : DBCC validation - Removing corruption

SQL Server 2008 : Mirroring in action (part 3) - Suspending and resuming mirroring, Initiating failover

SQL Server 2008 : Mirroring in action (part 2) - Monitoring database mirroring

SQL Server 2008 : Mirroring in action (part 1) - Mirroring setup

SQL Server 2008 : High availability with database mirroring - Failover options