Tuesday, April 8, 2014

MySQL Developers Conference: MySQL Replication and Scalability

The slide deck contains the latest developments in MySQL Replication. It covers:
- An introduction to MySQL Replication
- Scaling with Multi-threaded slaves
- Data aggregation with Multi-source replication
- Lossless failover with semi-synchronous replication
- Replication Monitoring made easier

Tuesday, April 1, 2014

MySQL-5.7.4- Change master without stopping slave altogether

At MySQL, we have been working on simplifying the failover process making it faster, more flexible and easier to use. MySQL-5.6 came up with Global Transaction Identifiers (GTID) which was a huge leap in the direction of easing the failover process hiding the details about replication logs and positions. With MySQL-5.7.4, we are introducing a new feature that further adds to flexibility and onliness- the user can only shut down components that he needs to re-configure.

What we allow with this new feature is to execute CHANGE MASTER TO command without stopping slave altogether. We realized that stopping slave altogether is not mandatory in all cases and doing that was more of a cautious approach to switching master restricting more than what’s required at times.

Lets dive deep into this to see what can be relaxed here. For this, lets break the replication process into two modules:
M1) Receiver module (concerned with IO thread) and
M2) Applier module (concerning SQL thread or coordinator and worker threads, whichever be the case)

We can now divide options under the command CHANGE MASTER TO into three groups based on the above classification: 

G1) Options that change a receiver configuration. 
G2) Options that change an applier configuration. 
G3) Options that relate to both (1) and (2). 

For the precise division look at the picture below. Note that the illustration takes into account all the CHANGER MASTER TO options present currently (MySQL-5.7.4). 




Note that given its current usage, we could put the MASTER_AUTO_POSITION  option under group G1(i.e., receive side). Currently only the receiver module uses GTID positioning but we foresee that in future it will be good to allow the applier module to use GTID positioning. We thus keep the master_auto_position option under group G3 to keep things future-proof. Worried that obstructs the failover process again like before? Well that's not a problem as MASTER_AUTO_POSITION is a slave's configuration that you only set once. Then it affects all future times that you redirect to a new immediate master. So you don't specify it on fail-over.

With the classifications stated above, we propose a 3-point rule stated as:
R1) For CHANGE MASTER TO options under group G1, stop only receiver module (M1) using the command STOP SLAVE IO_THREAD command.
R2) For CHANGE MASTER TO options under group G2, stop only applier module (M2) using the command STOP SLAVE SQL_THREAD command.
R3) For CHANGE MASTER TO options under group G3, stop both receiver (M1) and
applier modules (M2) using the command STOP SLAVE.

HOW DOES THIS RULE WORK?
Lets explore more about our 3-point rule(R1-R3):

  • Starting with rule R1, we stated that we only need to stop the receiver thread to change receive options. What happens to the applier module? Well the applier module keeps applying pending transactions, if any. If you have a situation where the slave was lagging behind with a lot of transactions queued into the slave's logs, you can allow the applier module to catch up while you switch masters or change a configuration on the receive side keeping the master same.
  • Under rule R2, we stated that we can change configuration of applier module after stopping the applier threads ONLY, receiver module can be running while you do this. So while you fine-tune your slave applier module the receiver module keeps reading master's log and copying transactions to the slave's log. These could then be applied in-parallel by the slave when the applier module is up and running.
  • Under rule R3, you stop both receiver and applier modules. So, this is analogous to
    STOP SLAVE;
    CHANGE MASTER TO <master_def>;
    used before this feature was available.

Worried how relay log purge would be handled now? Well its pretty simple- Under rules R1 and R2, we do not purge logs implicitly on executing the CHANGE MASTER command so that the receiver or applier whichever is running just keeps processing/adding relay logs as it would do if no replication thread was stopped.

Finally note that you need not always look at the figure above to find which options are allowed which thread being stopped. You just need to ask yourself if the parameter is related to receiver thread and stop the concerned thread. And if you go wrong there are error messages to guide you on the right path. Look at the next section for the usage and the errors.

EXAMPLES OF USAGE
example 1:
Previously, to change master heartbeat period, you would do a

STOP SLAVE;
CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD= <value>;
START SLAVE;

Now, with this feature you just have to stop the receiver (io) thread as heartbeat has nothing to do with the applier thread(s).
 
STOP SLAVE IO_THREAD;
CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD= <value>;
START SLAVE IO_THREAD;

Note that the applier thread keeps executing the transactions in the relay log while you change the heartbeat period for the master-slave connection. Likewise, you could do this with all the attributes mentioned in group G1 in the figure above.

example 2:
Similarly, to change applier thread attributes, you just have to stop the applier threads.
So instead of

STOP SLAVE;
CHANGE MASTER TO MASTER_DELAY=<value>;
START SLAVE;

it is enough to do the following with this feature.

STOP SLAVE SQL_THREAD;
CHANGE MASTER TO MASTER_DELAY=<value>;
START SLAVE SQL_THREAD;

Lastly, if you go wrong there are nicely worded error message to guide you. So in the first case, if your receiver module is active and you execute a

CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD= <value>;

 you get an error saying:
This operation cannot be performed with a running slave io thread; run STOP SLAVE IO_THREAD first.


and if you forgot changing applier module when it was required, the server will say:

This operation cannot be performed with a running slave sql thread; run STOP SLAVE SQL_THREAD first.


Lastly you still have the error message saying both the threads should stop appearing only for MASTER_AUTO_POSITION option now:
MASTER
This operation cannot be performed with a running slave; run STOP SLAVE first.

Lets see some examples once again:

example 3:
slave>START SLAVE;
slave>CHANGE MASTER TO MASTER_DELAY= 10;
ERROR 1900 (HY000): This operation cannot be performed with a running slave sql thread; run STOP SLAVE SQL_THREAD first

example 4:
mysql> CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD= 10;
ERROR 1904 (HY000): This operation cannot be performed with a running slave io thread; run STOP SLAVE IO_THREAD first.

example 5:
mysql> CHANGE MASTER TO MASTER_AUTO_POSITION= 0;
ERROR 1198 (HY000): This operation cannot be performed with a running slave; run STOP SLAVE first

SIDE-EFFECTS?

While implementing this, we have taken special care to make sure we dont break anything for a user switching masters like:

STOP SLAVE;
CHANGE MASTER to <master_def>;
START SLAVE.

There are absolutely NO side-effects to worry you. Note that as stated before, CHANGE MASTER TO will not delete relay logs if one of the receiver or applier thread is running.

Try it out and give us your feedback. As always, we look forward to hearing from you to improve this feature. Enjoy :)