3 controller environment rabbitmq_init_bundle container only exits successfully on controller-0 during FFU and upgrades

Bug #1753949 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Michele Baldessari

Bug Description

Seen via https://bugzilla.redhat.com/show_bug.cgi?id=1547589 and https://bugzilla.redhat.com/show_bug.cgi?id=1551265

OSP11 -> OSP12 upgrade: major upgrade composable step fails because rabbitmq_init_bundle container fails to start on 2/3 controllers. We can see in the rabbitmq_init_bundle containerthat the '/usr/sbin/rabbitmqctl -q list_users' command fails on controller-1 and controller-2. The command succeeds on the container on controller-1.

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Michele Baldessari (michele)
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Michele Baldessari (michele) wrote :

And there seems to be a 3rd bug https://bugzilla.redhat.com/show_bug.cgi?id=1551397 that might be related to this. Damien and I looked at logs of https://bugzilla.redhat.com/show_bug.cgi?id=1551265 and what puzzled us is that rabbitmqctl list_users timed out even though rabbitmq was clearly up on the node:
ctrl-1:
Mar 03 05:34:48 controller-1 dockerd-current[364070]: Debug: Executing: '/usr/sbin/rabbitmqctl -q list_users'
...
Mar 03 05:38:08 controller-1 dockerd-current[364070]: Debug: Executing: '/usr/sbin/rabbitmqctl -q list_users'
shortly after puppet fails due to timeout

But rabbitmq was up after "05:36:40":
Mar 03 05:35:17 controller-1 docker(rabbitmq-bundle-docker-1)[386276]: INFO: checking for nsenter, which is required when 'monitor_cmd' is specified
Mar 03 05:35:17 controller-1 docker(rabbitmq-bundle-docker-1)[386425]: INFO: monitor cmd passed: exit code = 0
Mar 03 05:35:17 controller-1 docker(rabbitmq-bundle-docker-1)[386485]: INFO: monitor cmd passed: exit code = 0
Mar 03 05:35:39 controller-1 docker(rabbitmq-bundle-docker-1)[387381]: INFO: checking for nsenter, which is required when 'monitor_cmd' is specified
Mar 03 05:35:40 controller-1 docker(rabbitmq-bundle-docker-1)[387542]: INFO: monitor cmd passed: exit code = 0
Mar 03 05:35:40 controller-1 docker(rabbitmq-bundle-docker-1)[387605]: INFO: monitor cmd passed: exit code = 0
Mar 03 05:35:45 controller-1 rabbitmq-cluster(rabbitmq)[388021]: DEBUG: rabbitmq monitor : 7
Mar 03 05:36:40 controller-1 docker(rabbitmq-bundle-docker-1)[392477]: INFO: monitor cmd passed: exit code = 0
Mar 03 05:36:48 controller-1 rabbitmq-cluster(rabbitmq)[393245]: DEBUG: rabbitmq monitor : 0
Mar 03 05:37:02 controller-1 rabbitmq-cluster(rabbitmq)[394468]: DEBUG: rabbitmq monitor : 0
Mar 03 05:37:16 controller-1 rabbitmq-cluster(rabbitmq)[395249]: DEBUG: rabbitmq monitor : 0

So while making sure the rabbitmq_user piece of code gets only triggered on a) bootstrap node and b) potentially on BM only. The real question is why does "rabbitmqctl list_users" not return success if the bundle is up on the node?

Revision history for this message
Michele Baldessari (michele) wrote :

So the reason for the failure on ctrl-1 and ctrl-2 is that rabbit refuses the rabbitmqctl calls:
=ERROR REPORT==== 3-Mar-2018::05:37:39 ===** Connection attempt from disallowed node 'rabbitmq-cli-67@controller-1' **

The reason for the refusal is that there is no proper cookie set on these controllers.
The reason for the cookie not being set on controller-1 and controller-2 is that the Exec['rabbitmq-ready'] -> Rabbitmq_user<||> collector is being set only on the bootstrap node.
The reason for it being set only there is that we do not want to enforce the rabbitmq-ready on all nodes (think controller replacement)

So the solution is to do what we did for the veritas user which is to create it only on the bootstrap node

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/549787
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=2abe91fe867fa18b6181324a083f4a33b60066e8
Submitter: Zuul
Branch: master

commit 2abe91fe867fa18b6181324a083f4a33b60066e8
Author: Michele Baldessari <email address hidden>
Date: Mon Mar 5 16:17:11 2018 +0100

    Fix stack update with rabbitmq containers

    In change I44865af3d5eb2d37eb648ac7227277e86c8fbc54 we
    add support to change rabbitmq password on update.
    This breaks when using containers in a number of scenarios:
    - FFU because at this stage rabbitmq can be down on the node
      and the call to rabbitmq_user will trigger a rabbitmqctl list_user
      call which will eventually time out.
    - Controller replacement procedure because on the newly replaced
      controller rabbitmq will not be up yet and the rabbitmq_user call
      will timeout just like during FFU.
    - Upgrades from Pike to Queens upgrades we seem to be hitting the
      same issue as FFU

    The exact error that we will get on the non bootstrap nodes is the
    following:
    =ERROR REPORT==== 3-Mar-2018::05:37:39 ===** Connection attempt from
    disallowed node 'rabbitmq-cli-67@controller-1' **

    The reason for this is that on non bootstrap node we do not enforce
    the Exec['rabbitmq-ready'] -> Rabbitmq_user<||> collector, because
    we do not want to enforce it there (think controller replacement)

    Let's make sure we enforce the Rabbitmq_user class only on bootstrap
    nodes, since in HA deployments the users get replicated by the cluster anyway.

    Co-Authored-By: Damien Ciabrini <email address hidden>
    Co-Authored-By: John Eckersberg <email address hidden>
    Closes-Bug: #1753949

    Change-Id: I483fe61f09fa2c3034d2b3d8ffa1ca53feefe6af

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/550741

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/550743

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/pike)

Reviewed: https://review.openstack.org/550743
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=6e4aeda8c487dbb38937cb06bd170c3212821c9d
Submitter: Zuul
Branch: stable/pike

commit 6e4aeda8c487dbb38937cb06bd170c3212821c9d
Author: Michele Baldessari <email address hidden>
Date: Mon Mar 5 16:17:11 2018 +0100

    Fix stack update with rabbitmq containers

    In change I44865af3d5eb2d37eb648ac7227277e86c8fbc54 we
    add support to change rabbitmq password on update.
    This breaks when using containers in a number of scenarios:
    - FFU because at this stage rabbitmq can be down on the node
      and the call to rabbitmq_user will trigger a rabbitmqctl list_user
      call which will eventually time out.
    - Controller replacement procedure because on the newly replaced
      controller rabbitmq will not be up yet and the rabbitmq_user call
      will timeout just like during FFU.
    - Upgrades from Pike to Queens upgrades we seem to be hitting the
      same issue as FFU

    The exact error that we will get on the non bootstrap nodes is the
    following:
    =ERROR REPORT==== 3-Mar-2018::05:37:39 ===** Connection attempt from
    disallowed node 'rabbitmq-cli-67@controller-1' **

    The reason for this is that on non bootstrap node we do not enforce
    the Exec['rabbitmq-ready'] -> Rabbitmq_user<||> collector, because
    we do not want to enforce it there (think controller replacement)

    Let's make sure we enforce the Rabbitmq_user class only on bootstrap
    nodes, since in HA deployments the users get replicated by the cluster anyway.

    Co-Authored-By: Damien Ciabrini <email address hidden>
    Co-Authored-By: John Eckersberg <email address hidden>
    Closes-Bug: #1753949

    Change-Id: I483fe61f09fa2c3034d2b3d8ffa1ca53feefe6af
    (cherry picked from commit 2abe91fe867fa18b6181324a083f4a33b60066e8)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/queens)

Reviewed: https://review.openstack.org/550741
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=9454f3823b47c7500fe66bc20521c2ed076da7db
Submitter: Zuul
Branch: stable/queens

commit 9454f3823b47c7500fe66bc20521c2ed076da7db
Author: Michele Baldessari <email address hidden>
Date: Mon Mar 5 16:17:11 2018 +0100

    Fix stack update with rabbitmq containers

    In change I44865af3d5eb2d37eb648ac7227277e86c8fbc54 we
    add support to change rabbitmq password on update.
    This breaks when using containers in a number of scenarios:
    - FFU because at this stage rabbitmq can be down on the node
      and the call to rabbitmq_user will trigger a rabbitmqctl list_user
      call which will eventually time out.
    - Controller replacement procedure because on the newly replaced
      controller rabbitmq will not be up yet and the rabbitmq_user call
      will timeout just like during FFU.
    - Upgrades from Pike to Queens upgrades we seem to be hitting the
      same issue as FFU

    The exact error that we will get on the non bootstrap nodes is the
    following:
    =ERROR REPORT==== 3-Mar-2018::05:37:39 ===** Connection attempt from
    disallowed node 'rabbitmq-cli-67@controller-1' **

    The reason for this is that on non bootstrap node we do not enforce
    the Exec['rabbitmq-ready'] -> Rabbitmq_user<||> collector, because
    we do not want to enforce it there (think controller replacement)

    Let's make sure we enforce the Rabbitmq_user class only on bootstrap
    nodes, since in HA deployments the users get replicated by the cluster anyway.

    Co-Authored-By: Damien Ciabrini <email address hidden>
    Co-Authored-By: John Eckersberg <email address hidden>
    Closes-Bug: #1753949

    Change-Id: I483fe61f09fa2c3034d2b3d8ffa1ca53feefe6af
    (cherry picked from commit 2abe91fe867fa18b6181324a083f4a33b60066e8)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 7.4.11

This issue was fixed in the openstack/puppet-tripleo 7.4.11 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 8.3.1

This issue was fixed in the openstack/puppet-tripleo 8.3.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.0.0

This issue was fixed in the openstack/puppet-tripleo 9.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.