Replication slave start fails with guest timeout

Bug #1462520 reported by Doug Shelley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
Fix Released
Medium
Doug Shelley

Bug Description

I was doing a test of multiple replicas starting up against a mysql 5.5. master. I happened to be running under devstack within an OpenStack cloud environment and my nested nova instances were running a little slower than what I would expect. Each time I executed trove create ... --replica_count 2, one of the replicas would go into the FAILED state and then after a little while go into ACTIVE.

Did a little digging and noticed that one slave was getting a disconnect from the master upon START SLAVE. In this case, by default, the slave waits 60 seconds before attempting the connect to master again. However, the guest only waits for 60s for the slave to become active before marking it failed. Just after the 60 second retry timeout, the slave attempts the connect to master, was successful and on next heartbeat the status was set to ACTIVE. While all of this might sound reasonable, the issue is that I believe this behaviour is causing intermittent failures in the int-tests.

I don't believe the fix is to increase the timeout, I believe the fix is to shorten the retry value - 60 second is way too long to wait for the slave to retry master connect. This can only be done in Mysql 5.5 and higher using the MASTER_CONNECT_RETRY parameter on the CHANGE MASTER TO command which is used in the mysql replication strategies.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to trove (master)

Fix proposed to branch: master
Review: https://review.openstack.org/188933

Changed in trove:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to trove (master)

Reviewed: https://review.openstack.org/188933
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=052411578e96b7208a0099127a9a7c9af2fdc984
Submitter: Jenkins
Branch: master

commit 052411578e96b7208a0099127a9a7c9af2fdc984
Author: Doug Shelley <email address hidden>
Date: Fri Jun 5 17:36:18 2015 -0400

    Decrease replication slave retry wait time

    If the mysql replication slave fails to connect to the master on
    its first attempt, it will wait 60 seconds to retry (60 seconds is
    the mysql default). This is too long as the guest only waits 60
    seconds to determine if the slave is running. It doesn't make sense
    to increase the timeout as decreasing the connect retry time will
    make the slave get into a the correct state faster. This change
    decreases the retry time to 15 seconds which should be more than
    enough time to possibly withstand 2 or 3 failures and still get
    connected.

    Change-Id: Ib53daca645ed05f26258ca6b309ab11b0384edd9
    Closes-bug: 1462520

Changed in trove:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in trove:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in trove:
milestone: liberty-1 → 4.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.