Home > Databases, MySQL, Technical > MySQL Replication – Silent Timeout Failure

MySQL Replication – Silent Timeout Failure

July 19th, 2009

I have been encountering a strange error that causes MySQL replication to stop working without any errors reported via SHOW SLAVE STATUS. The command reports everything running ok and that the slave is caught up with the master (seconds behind master=0).

However, looking into the running processes via SHOW FULL PROCESSLIST I noticed that the thread reading the binary log data from the master has disappeared mysteriously.

Normally you should see two threads related to replication running on a slave server under the system user. One for reading the binary log data from the master and the other one for executing fetched data from the relay log:

mysql>SHOW FULL PROCESSLIST;
+--------+-------------+---------------------+------+---------+---------+-----------------------------------------------------------------------+-----------------------+
| Id     | User        | Host                | db   | Command | Time    | State                                                                 | Info                  |
+--------+-------------+---------------------+------+---------+---------+-----------------------------------------------------------------------+-----------------------+
|  37550 | system user |                     | NULL | Connect | 5878076 | Waiting for master to send event                                      | NULL                  |
|  37551 | system user |                     | NULL | Connect |       0 | Has read all relay log; waiting for the slave I/O thread to update it | NULL                  |
| 119738 | you         | 127.0.0.1:37540 | NULL | Query   |       0 | NULL                                                                  | SHOW FULL PROCESSLIST |
+--------+-------------+---------------------+------+---------+---------+-----------------------------------------------------------------------+-----------------------+
3 rows in set (0.17 sec)

mysql>

Restarting the slave would work for a short while until the same silent failure occurred.

In order to counter this, I looked into the variables related to networking and replication. modified the global variable ’slave_net_timeout’. The default variable for this is 3600 seconds (ie. 1 hour). The slave timeout sets the number of seconds to wait for the master to send more data before aborting the read. I set it to a short, 10 second period:

SET GLOBAL slave_net_timeout=10;

This improved the situation and the incidence of replication mysteriously stopping dropped dramatically. Additionally now the slave IO thread would actually stop running which, at least, given an indication that something is wrong.

Overall, the number of these failures dropped to almost zero just as suddenly as they had appeared. This leads me to believe that there might have been something in the network connection between the master and slave servers (in different datacenters) that was causing this.

Databases, MySQL, Technical , ,

  1. No comments yet.
  1. No trackbacks yet.