Replication slave start fails with guest timeout
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack DBaaS (Trove) |
Fix Released
|
Medium
|
Doug Shelley |
Bug Description
I was doing a test of multiple replicas starting up against a mysql 5.5. master. I happened to be running under devstack within an OpenStack cloud environment and my nested nova instances were running a little slower than what I would expect. Each time I executed trove create ... --replica_count 2, one of the replicas would go into the FAILED state and then after a little while go into ACTIVE.
Did a little digging and noticed that one slave was getting a disconnect from the master upon START SLAVE. In this case, by default, the slave waits 60 seconds before attempting the connect to master again. However, the guest only waits for 60s for the slave to become active before marking it failed. Just after the 60 second retry timeout, the slave attempts the connect to master, was successful and on next heartbeat the status was set to ACTIVE. While all of this might sound reasonable, the issue is that I believe this behaviour is causing intermittent failures in the int-tests.
I don't believe the fix is to increase the timeout, I believe the fix is to shorten the retry value - 60 second is way too long to wait for the slave to retry master connect. This can only be done in Mysql 5.5 and higher using the MASTER_
Changed in trove: | |
status: | Fix Committed → Fix Released |
Changed in trove: | |
milestone: | liberty-1 → 4.0.0 |
Fix proposed to branch: master /review. openstack. org/188933
Review: https:/