[CI][Queens] undercloud-upgrade task fails due to rabbitmq

Bug #1822120 reported by Cédric Jeanneret
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Cédric Jeanneret

Bug Description

Detected today:

2019-03-28 13:30:28 | 2019-03-28 13:30:28,046 INFO: Error: /Stage[main]/Rabbitmq/Rabbitmq_plugin[rabbitmq_management]: Could not evaluate: Command is still failing after 180 seconds expired!

http://logs.openstack.org/31/648331/2/check/tripleo-ci-centos-7-undercloud-upgrades/95eb12e/logs/undercloud/home/zuul/undercloud_upgrade.log.txt.gz#_2019-03-28_13_30_28

More info:
http://logs.openstack.org/31/648331/2/check/tripleo-ci-centos-7-undercloud<email address hidden>
ERROR: "Free disk space monitor encountered an error (e.g. failed to parse output from OS tools): ~p, retries left: ~s~n" - [{{'EXIT',
                                                                                                                               {unparseable,
                                                                                                                                []}},
                                                                                                                              8170045440},
                                                                                                                             5]

=INFO REPORT==== 28-Mar-2019::13:37:46 ===
ERROR: "Free disk space monitor encountered an error (e.g. failed to parse output from OS tools): ~p, retries left: ~s~n" - [{{'EXIT',
                                                                                                                               {unparseable,
                                                                                                                                []}},
                                                                                                                              8170045440},
                                                                                                                             4]

wes hayutin (weshayutin)
Changed in tripleo:
importance: Undecided → Critical
Revision history for this message
Alfredo Moralejo (amoralej) wrote :

Is this only happening in queens jobs?, are upgrade jobs working in newer releases?

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

Apparently the same operation worked fine in the rabbitmq update gate job [1]:

2019-03-26 05:50:13 | 2019-03-26 05:50:13,670 INFO: Notice: /Stage[main]/Rabbitmq/Rabbitmq_plugin[rabbitmq_management]/ensure: created

using the same version of rabbitmq server [2].

[1] http://logs.rdoproject.org/34/15234/7/check/rdoinfo-tripleo-queens-testing-centos-7-multinode-1ctlr-featureset016/1f57b04/logs/undercloud/home/zuul/undercloud_install.log.txt.gz

[2] http://logs.rdoproject.org/34/15234/7/check/rdoinfo-tripleo-queens-testing-centos-7-multinode-1ctlr<email address hidden>

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Some updates here as well:
the issue is caused by the upgrade of an erlang package - a socket stays open, linked to the old, deleted binary, preventing the new one to start properly.

Looks like a packaging issue - some more investigations from the relevant teams is currently done.

Revision history for this message
Damien Ciabrini (dciabrin) wrote :

Quick update:

The upgrade job [1] seems to fail due to a puppet resource failing to access rabbitmq plugins.

While running P->Q upgrade, package erlang-erts is upgraded live on the undercloud.

2019-03-29 13:55:30 | 2019-03-29 13:55:30,005 INFO: Error: /Stage[main]/Rabbitmq/Rabbitmq_plugin[rabbitmq_management]: Could not evaluate: Command is still failing after 180 seconds expired!
2019-03-29 13:55:30 | 2019-03-29 13:55:30,007 INFO: Notice: /Stage[main]/Rabbitmq::Service/Service[rabbitmq-server]: Dependency Rabbitmq_plugin[rabbitmq_management] has failures: true

When deploying the original Pike env (which installs erlang-erts-18.3.4.5-4.el7.x86_64), plugins can be queried at the command line:

$ rabbitmq-plugins list -E -m
rabbitmq_management

But once the major upgrade has run (and installed erlang-erts-19.3.6.4-1.el7.x86_64) and failed in puppet, I can no longer query the state of rabbitmq plugins, even if rabbitmq is still running:

$ rabbitmq-plugins list -E -m
Error: invalid parameter: []
Usage:
rabbitmq-plugins [-n <node>] <command> [<command options>]

Commands:
list [-v] [-m] [-E] [-e] [<pattern>]
enable [--offline] [--online] <plugin> ...
disable [--offline] [--online] <plugin> ...
set [--offline] [--online] <plugin> ...

[1] http://logs.openstack.org/18/645118/1/check/tripleo-ci-centos-7-undercloud-upgrades/5a80d14/logs/undercloud/home/zuul/undercloud_upgrade.log.txt.gz

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/649194

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to instack-undercloud (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/649287

Revision history for this message
Marios Andreou (marios-b) wrote :

even though not in itself a promotion blocker, this is now blocking the fix for a promotion blocker :/

https://review.openstack.org/#/c/649084/ (fix for https://bugs.launchpad.net/tripleo/+bug/1822080 ) is blocked on this

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to instack-undercloud (stable/queens)

Reviewed: https://review.openstack.org/649287
Committed: https://git.openstack.org/cgit/openstack/instack-undercloud/commit/?id=bbe2840e09468a5b3dc3ce3fbcdda7bf88818713
Submitter: Zuul
Branch: stable/queens

commit bbe2840e09468a5b3dc3ce3fbcdda7bf88818713
Author: Damien Ciabrini <email address hidden>
Date: Tue Apr 2 13:48:08 2019 +0200

    [Queens-only] Restart Erlang VM on undercloud upgrade

    When running an undercloud upgrade, yum can upgrade to a newer
    major version of Erlang, and it doesn't restart the running
    Erlang Port Mapper Daemon (epmd) nor RabbitMQ.

    When upgrading from Erlang 18 to Erlang 19, the Erlang VM
    implements differently calls to OS commands [1]. This confuses
    RabbitMQ recurring monitoring process:

        ERROR: "Free disk space monitor encountered an error (e.g. failed to parse output from OS tools):

    and also makes command line tools link management plugin fail
    until epmd is restarted:

        $ rabbitmq-plugins list -E -m
        Error: invalid parameter: []

    Make sure to restart both epmd and RabbitMQ if they are running,
    to start the upgrade from a valid erlang runtime.

    We only do that for Queens because starting Rocky, RabbitMQ
    is containerized and thus always restart both RabbitMQ and
    Erlang VM on container image updates.

    [1] https://github.com/erlang/otp/commit/200247f972b012ced0c4b2c6611f091af66ebedd

    Change-Id: I6f486b0b70f19d8b4916ef500675c0739939e060
    Closes-Bug: #1822120
    Co-Authored-By: Peter Lemenkov <email address hidden>

tags: added: in-stable-queens
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/queens)

Change abandoned by Damien Ciabrini (<email address hidden>) on branch: stable/queens
Review: https://review.openstack.org/649194
Reason: Superseded by https://review.openstack.org/649287

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/instack-undercloud 8.4.8

This issue was fixed in the openstack/instack-undercloud 8.4.8 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.