nodes joining rabbitmq cluster sometimes hang

Bug #1573030 reported by Darren Birkett
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack-Ansible
Fix Released
Undecided
Darren Birkett
Liberty
Fix Committed
Undecided
Darren Birkett
Mitaka
Fix Committed
Undecided
Darren Birkett
Trunk
Fix Released
Undecided
Darren Birkett

Bug Description

Sometimes when secondary/tertiary nodes join the primary node and try and form a cluster, the join hangs indefinitely and ends up being killed by the gate job timeout.

Since this is not easily reproducible, it may take some time to track down. The OSA gate logs look like this:

2016-04-20 18:39:37.738 | TASK: [{{ rolename | basename }} | Join rabbitmq cluster] *********************
2016-04-20 18:39:37.782 | skipping: [container1]
2016-04-20 18:39:39.160 | changed: [container3]
2016-04-20 19:32:52.965 | Build timed out (after 60 minutes). Marking the build as failed.
2016-04-20 19:32:53.020 | Build was aborted
2016-04-20 19:32:53.020 | [SCP] Copying console log.
2016-04-20 19:32:53.496 | [SCP] Trying to create /srv/static/logs/52/308352/2/gate/gate-openstack-ansible-rabbitmq_server-ansible-func-ubuntu-trusty
2016-04-20 19:32:53.544 | [SCP] Trying to create /srv/static/logs/52/308352/2/gate/gate-openstack-ansible-rabbitmq_server-ansible-func-ubuntu-trusty/4483eb0
2016-04-20 19:32:53.590 | Finished: FAILURE

Revision history for this message
Darren Birkett (darren-birkett) wrote :

My current working theory is that when the joining nodes attempt to join the first node and form a cluster, they are all doing it at exactly the same time. This is possibly causing a race where they attempt to sync from each other, but because they have not synced from the master yet they have database mnesia inconsistencies and the join fails (it shouldn't hang, but you know, rabbitmq).

Probably need a separate play for the primary node, then a separate play for joiners set to use serial so they join one at a time. Or something else that makes them join one at a time (you can't just serialise one task apparently).

Will update more if I confirm this theory

Changed in openstack-ansible:
assignee: nobody → Darren Birkett (darren-birkett)
Revision history for this message
Darren Birkett (darren-birkett) wrote :

So I managed to prove the theory - joining nodes, when left to all join at the same time (as is the current implementation) are often hitting a situation where they try and sync from each other when neither is actually synced yet. This causes an mnesia db issue, and the join hangs. rabbit doesn't seem to have a similar implementation to a galera cluster, where a 'donor' node has to be fully synced before it can be a donor. It seems that a joining node can pick another joining node to sync from even though it's not itself synced yet.

To prove this I added serial=1 to the playbook that calls the role, and moved the testing out into a separate play (so that it wasn't getting called at the end of each serial run of the playbook/role which would fail). Running a loop of destroying containers and then running the tests over and over again on 10 separate nodes for 4 days straight, I have not seen a single instance of a cluster join hang.

I'll post a review up soon and the implementation can be debated (it's a bit dirty in the test plays at least because I needed to hardcode the path) to the role default vars.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible-rabbitmq_server (master)

Fix proposed to branch: master
Review: https://review.openstack.org/310542

Changed in openstack-ansible:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (master)

Fix proposed to branch: master
Review: https://review.openstack.org/313168

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (master)

Reviewed: https://review.openstack.org/313168
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=21c24789019bf1ee85347356fe6c70b0b9045d7c
Submitter: Jenkins
Branch: master

commit 21c24789019bf1ee85347356fe6c70b0b9045d7c
Author: Darren Birkett <email address hidden>
Date: Thu May 5 22:17:11 2016 +0100

    install rabbitmq-server in serial

    In order to prevent race conditions with nodes joining the cluster
    simultaneously when the cluster is first formed, we move the rabbitmq
    installation play to be 'serial: 1'. However, when the nodes are being
    upgraded, it cannot be done in serial so in this case we set 'serial: 0'

    There are some tasks/roles called in this playbook that can still be
    run in parallel, so we split out the rabbitmq-server install into a
    separate play so that we only serialise the parts that are necessary
    to ensure maximum efficiency.

    Change-Id: I97cdae27fdce4f400492c2134b6589b55fbc5a61
    Fixes-Bug: #1573030

Changed in openstack-ansible:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (liberty)

Fix proposed to branch: liberty
Review: https://review.openstack.org/314457

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/314458

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible-rabbitmq_server (master)

Reviewed: https://review.openstack.org/310542
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-rabbitmq_server/commit/?id=5dc67955f0ac08a7c9719d641e9828558620da89
Submitter: Jenkins
Branch: master

commit 5dc67955f0ac08a7c9719d641e9828558620da89
Author: Darren Birkett <email address hidden>
Date: Thu May 5 22:36:34 2016 +0100

    install rabbitmq-server in serial

    In order to prevent race conditions with nodes joining the cluster
    simultaneously when the cluster is first formed, we move the rabbitmq
    installation play to be 'serial: 1'. However, when the nodes are being
    upgraded, it cannot be done in serial so in this case we set 'serial: 0'

    The tests are removed from a post_task include in the install play, and
    moved to their own play as they need to be run after the entire cluster
    has been formed. As well as moving a few generic vars into the
    test-vars.yml include, we also pass in the specific version of rabbitmq
    to be tested against in the test play.

    Fixes-Bug: #1573030

    Change-Id: Id119ff9f20ddfd8e1f29598c8c5ce862d2e7fab4

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (stable/mitaka)

Reviewed: https://review.openstack.org/314458
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=0861feaf36bf2f1f6c52a8587cbf2b2dcbade65e
Submitter: Jenkins
Branch: stable/mitaka

commit 0861feaf36bf2f1f6c52a8587cbf2b2dcbade65e
Author: Darren Birkett <email address hidden>
Date: Thu May 5 22:17:11 2016 +0100

    install rabbitmq-server in serial

    In order to prevent race conditions with nodes joining the cluster
    simultaneously when the cluster is first formed, we move the rabbitmq
    installation play to be 'serial: 1'. However, when the nodes are being
    upgraded, it cannot be done in serial so in this case we set 'serial: 0'

    There are some tasks/roles called in this playbook that can still be
    run in parallel, so we split out the rabbitmq-server install into a
    separate play so that we only serialise the parts that are necessary
    to ensure maximum efficiency.

    Change-Id: I97cdae27fdce4f400492c2134b6589b55fbc5a61
    Fixes-Bug: #1573030
    (cherry picked from commit 21c24789019bf1ee85347356fe6c70b0b9045d7c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible (liberty)

Reviewed: https://review.openstack.org/314457
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible/commit/?id=068e948b6143b33b988d91117771c9e834ec7766
Submitter: Jenkins
Branch: liberty

commit 068e948b6143b33b988d91117771c9e834ec7766
Author: Darren Birkett <email address hidden>
Date: Tue May 10 09:11:11 2016 +0100

    install rabbitmq-server in serial

    In order to prevent race conditions with nodes joining the cluster
    simultaneously when the cluster is first formed, we move the rabbitmq
    installation play to be 'serial: 1'. However, when the nodes are being
    upgraded, it cannot be done in serial so in this case we set 'serial: 0'

    There are some tasks/roles called in this playbook that can still be
    run in parallel, so we split out the rabbitmq-server install into a
    separate play so that we only serialise the parts that are necessary
    to ensure maximum efficiency.

    Based on commit: 21c24789019bf1ee85347356fe6c70b0b9045d7c

    Change-Id: I97cdae27fdce4f400492c2134b6589b55fbc5a61
    Fixes-Bug: #1573030

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-ansible-rabbitmq_server (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/322014

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-ansible-rabbitmq_server (stable/mitaka)

Reviewed: https://review.openstack.org/322014
Committed: https://git.openstack.org/cgit/openstack/openstack-ansible-rabbitmq_server/commit/?id=f44e61a291bca62cf3154bb5f3d4e4f619f80745
Submitter: Jenkins
Branch: stable/mitaka

commit f44e61a291bca62cf3154bb5f3d4e4f619f80745
Author: Darren Birkett <email address hidden>
Date: Thu May 5 22:36:34 2016 +0100

    install rabbitmq-server in serial

    In order to prevent race conditions with nodes joining the cluster
    simultaneously when the cluster is first formed, we move the rabbitmq
    installation play to be 'serial: 1'. However, when the nodes are being
    upgraded, it cannot be done in serial so in this case we set 'serial: 0'

    The tests are removed from a post_task include in the install play, and
    moved to their own play as they need to be run after the entire cluster
    has been formed. As well as moving a few generic vars into the
    test-vars.yml include, we also pass in the specific version of rabbitmq
    to be tested against in the test play.

    Fixes-Bug: #1573030

    Change-Id: Id119ff9f20ddfd8e1f29598c8c5ce862d2e7fab4
    (cherry picked from commit 5dc67955f0ac08a7c9719d641e9828558620da89)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.