Centos8 train minor update failing ERROR configuring rabbit_init_tasks

Bug #1887176 reported by Marios Andreou
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Jose Luis Franco

Bug Description

The Centos8 train minor update job is failing at [1][2] during the controller update run with a trace like:

        * 2020-07-10 11:25:18.209289 | primary | FAILED - RETRYING: was the overcloud minor update successful? (25010 retries left).
2020-07-10 11:25:23.703865 | primary | failed: [undercloud] (item={'started': 1, 'finished': 0, 'ansible_job_id': '159939992880.181202', 'results_file': '/home/zuul/.ansible_async/159939992880.181202', 'changed': True, 'failed': False, 'oc_running_role': 'Controller', 'ansible_loop_var': 'oc_running_role'}) => {
        * 2020-07-10 11:25:23.704991 | primary | "cmd": "set -o pipefail\nbash /home/zuul/overcloud_update_run-Controller.sh 2>&1 | awk '{ print strftime(\"%Y-%m-%d %H:%M:%S |\"), $0; fflush(); }' > /home/zuul/overcloud_update_run_Controller.log\n",

Digging a little deeper at the package_update.log [3] referenced in the update_run_Controller.log (from trace above) we see a lot of these:

        * 2020-07-10 11:11:42,689 ERROR: 75327 -- Error executing ['podman', 'container', 'exists', 'container-puppet-rabbitmq']: returned 1",
        * 2020-07-10 11:11:42,707 ERROR: 75325 -- Error executing ['podman', 'container', 'exists', 'container-puppet-keystone']: returned 1",

and finally

        * 2020-07-10 11:28:50,158 ERROR: 89033 -- ERROR configuring rabbit_init_tasks",

I can't see the problem in rabbit_init_tasks.log [4] though it seems to complete?

Note that this job is new/being added with [5].

[1] https://d77580f8adad064c9e13-1cbf7c2bad1b53c710605a1cfc31790e.ssl.cf5.rackcdn.com/739459/7/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates/d51f37d/job-output.txt
[2] https://12df9c834cfda7bed796-e5788b19bd89c28fd446f909c1b0ae99.ssl.cf5.rackcdn.com/739217/6/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train/916ad5a/job-output.txt
[3] https://12df9c834cfda7bed796-e5788b19bd89c28fd446f909c1b0ae99.ssl.cf5.rackcdn.com/739217/6/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train/916ad5a/logs/undercloud/var/log/containers/mistral/package_update.log
[4] https://12df9c834cfda7bed796-e5788b19bd89c28fd446f909c1b0ae99.ssl.cf5.rackcdn.com/739217/6/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train/916ad5a/logs/undercloud/var/log/containers/stdouts/container-puppet-rabbit_init_tasks.log
[5] https://review.opendev.org/#/c/739217/

Revision history for this message
Marios Andreou (marios-b) wrote :

I got a green run of this job today at https://review.opendev.org/#/c/739459/ using matbu's patches https://review.opendev.org/709134 & https://review.opendev.org/709066 which avoid mistral.

I will work with the ugprades team to see if we can merge those.

Revision history for this message
Marios Andreou (marios-b) wrote :

discussed this a bit today with chem and matbu from upgrades.
Long story short it is preferable to have the train centos8 job running *with* the mistral workflow since that is what is currently the default in train/OSP16.

Mathieu patches at https://review.opendev.org/709134 & https://review.opendev.org/709066 will make it that we *can* run the minor update without mistral, however it will still just be an option.

So - conclusions (will ask DFG:Upgrades to confirm ):

1. For *now* we should be exercising the mistral workflow for train I think.

2. IF and once the patches at /709134 /709066 merge we can consider also adding another job that tests the --no-workflow.

3. IF and once --no-workflow becomes default we can remove the job that exercises mistral.

Train is a special case - ussuri and master are already running by default without mistral.

Revision history for this message
Sofer Athlan-Guyot (sofer-athlan-guyot) wrote :

Totally agree with Marios' previous comment.

Revision history for this message
Marios Andreou (marios-b) wrote :

So we are back to digging and we need this to work with mistral. One big question from me is why does this rabbit issue only happen with mistral, per comment #1 above it runs green with --no-workflow.

Issue happens during the Controller update and seems to be with rabbit. I am pasting some of the more interesting logs I saw today - hoping they will mean more to the pidone folks I point at this comment:

At [1]:

        =INFO REPORT==== 15-Jul-2020::12:09:21 ===
        Error description:
           {badmatch,{aborted,{"Bad transform function",rabbit_durable_exchange,
                               #Fun<rabbit_upgrade_functions.4.51494978>,
                               'rabbit@centos-8-ovh-gra1-0018220965',
                               function_clause}}}

At [2]:

        2020-07-15T12:30:45.322368473+00:00 stdout F BOOT FAILED
        2020-07-15T12:30:45.322368473+00:00 stdout F "Found lock file at ~s.\n Either previous upgrade is in progress or has failed.\n Database backup path: ~s"

[1] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_bc8/739217/8/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train/bc8dfe5/logs/subnode-1/var/log/containers/rabbitmq/rabbit%40centos-8-ovh-gra1-0018220965.log
[2] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_bc8/739217/8/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train/bc8dfe5/logs/subnode-1/var/log/containers/stdouts/rabbitmq.log

Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (6.6 KiB)

So this is the starting version (I think):
sudo podman run -it -uroot --net=host --rm docker.io/tripleotraincentos8/centos-binary-rabbitmq:96dc7dca942efb4750860a30e60105cc sh -c 'rpm -q rabbitmq-server erlang-erts'
rabbitmq-server-3.8.3-1.el8.x86_64
erlang-erts-22.3.4.1-1.el8.x86_64

1)rabbitmq starts correctly at:
2020-07-15T11:49:15.031165119+00:00 stdout P Starting broker...
2020-07-15T11:49:16.373581044+00:00 stdout F completed with 3 plugins.

2) then it gets stopped successfully at 12:09:12
2020-07-15T12:09:12.996471636+00:00 stdout F Stopping and halting node rabbit@centos-8-ovh-gra1-0018220965 ...
2020-07-15T12:09:13.185513351+00:00 stderr F Gracefully halting Erlang VM

3) Then it tries to start at 12:09:15, at which point first kolla does a truckload of chowns/chmods:

2020-07-15T12:09:15.904552550+00:00 stderr F INFO:__main__:Setting permission for /var/lib/rabbitmq
2020-07-15T12:09:15.904699018+00:00 stderr F INFO:__main__:Setting permission for /var/lib/rabbitmq/mnesia
2020-07-15T12:09:15.904781483+00:00 stderr F INFO:__main__:Setting permission for /var/lib/rabbitmq/.erlang.cookie
2020-07-15T12:09:15.904954853+00:00 stderr F INFO:__main__:Setting permission for /var/lib/rabbitmq/mnesia/rabbit@centos-8-ovh-gra1-0018220965
2020-07-15T12:09:15.905014819+00:00 stderr F INFO:__main__:Setting permission for /var/lib/rabbitmq/mnesia/rabbit@centos-8-ovh-gra1-0018220965-plugins-expand

4) After that it fails with:
2020-07-15T12:09:15.972247410+00:00 stderr F ++ chmod 755 /var/log/kolla/rabbitmq
2020-07-15T12:09:15.973874699+00:00 stdout F Running command: '/usr/lib/rabbitmq/bin/rabbitmq-server'
2020-07-15T12:09:15.973901567+00:00 stderr F + echo 'Running command: '\''/usr/lib/rabbitmq/bin/rabbitmq-server'\'''
2020-07-15T12:09:15.973901567+00:00 stderr F + exec /usr/lib/rabbitmq/bin/rabbitmq-server
2020-07-15T12:09:21.261250425+00:00 stdout F Mnesia('rabbit@centos-8-ovh-gra1-0018220965'): Transform function failed: 'function_clause' in '[{rabbit_upgrade_functions,
2020-07-15T12:09:21.261250425+00:00 stdout F '-internal_exchanges/0-fun-0-',
2020-07-15T12:09:21.261250425+00:00 stdout F [{exchange,
2020-07-15T12:09:21.261250425+00:00 stdout F {resource,

2020-07-15T12:09:21.262545777+00:00 stdout F BOOT FAILED
2020-07-15T12:09:21.262545777+00:00 stdout F ===========
2020-07-15T12:09:21.262545777+00:00 stdout F
2020-07-15T12:09:21.262545777+00:00 stdout F Error description:
2020-07-15T12:09:21.262545777+00:00 stdout F {badmatch,{aborted,{"Bad transform function",rabbit_durable_exchange,
2020-07-15T12:09:21.262545777+00:00 stdout F #Fun<rabbit_upgrade_functions.4.51494978>,
2020-07-15T12:09:21.262545777+00:00 stdout F 'rabbit@centos-8-ovh-gra1-0018220965',
2020-07-15T12:09:21.262545777+00:00 stdout F function_clause}}}
2020-07-15T12:09:21.262545777+00:00 stdout F
2020-07-15T1...

Read more...

Revision history for this message
Michele Baldessari (michele) wrote :

Also why on earth did we move to rabbitmq 3.8.x on train? Now we're diverging considerably from downstream

Revision history for this message
Marios Andreou (marios-b) wrote :

WRT why rabbitmq 3.8.x on train - looks like we get that from delorean-deps [1] which points us to the centos8 messaging repos [2] and right now and AFAICS that is in fact the only version available there.

So I think we can say it's quite likely we're not updating rabbit, but the update tasks are being executed and they may be trying to update _something_ or run some migrations... I haven't digged there yet I'll see if I can understand which update task(s) might be causing this.

[1] https://trunk.rdoproject.org/centos8-train/delorean-deps.repo
[2] http://mirror.centos.org/centos/8/messaging/x86_64/

Revision history for this message
Marios Andreou (marios-b) wrote :

in fact as far as i can see in [1] it doesn't even get as far as the update tasks. Looks like the fail is coming from [2]

        * 2020-07-15 12:10:19 | TASK [Run container-puppet tasks (bootstrap tasks) for step 2 with paunch] *****
        2020-07-15 12:10:19 | task path: /var/lib/mistral/057a369e-e9bb-4689-8ab1-7465ab97b0de/host-container-puppet-tasks.yaml:17
    ...
        * 2020-07-15 12:10:19 | TASK [Wait for container-puppet tasks (bootstrap tasks) for step 2 to finish] ***
        2020-07-15 12:10:19 | task path: /var/lib/mistral/057a369e-e9bb-4689-8ab1-7465ab97b0de/host-container-puppet-tasks.yaml:37

The step2 tasks [3] don't really tell us much (we already know rabbit_init_tasks are failing).

One big unanswered thing for me is still - why do we see this bug using mistral only? It seems strange that rabbit/mnesia upgrades: is happening and failing here but only with mistral.

[1] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_bc8/739217/8/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train/bc8dfe5/logs/undercloud/home/zuul/overcloud_update_run_Controller.log
[2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/common/host-container-puppet-tasks.yaml
[3] https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_bc8/739217/8/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train/bc8dfe5/logs/undercloud/var/lib/container-puppet/container-puppet-tasks2.json

Revision history for this message
Marios Andreou (marios-b) wrote :

also seen in the periodic https://bugs.launchpad.net/tripleo/+bug/1888669 (marked it duplicate for this to keep discussion in one place)

wes hayutin (weshayutin)
tags: added: alert
Changed in tripleo:
milestone: victoria-1 → victoria-3
Revision history for this message
Marios Andreou (marios-b) wrote :

adding some fresh logs since the original ones from description are a couple of weeks old already :)

from the test at [1] yesterday we have the same trace described in this bug see [2]

        * 2020-07-28 19:39:04 | 2020-07-28 19:39:04.799 170863 ERROR openstack RuntimeError: Update failed with: Ansible failed, check log at /var/log/containers/mistral/package_update.log.

and then [3]

        * "2020-07-28 19:39:03,321 ERROR: 80687 -- ERROR configuring rabbit_init_tasks",

[1] https://review.opendev.org/#/c/739459/
[2] https://5173a9504b4112de0add-33dcd8a3e46b9742c8c437a07c900baf.ssl.cf1.rackcdn.com/739459/18/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train/8a8b0eb/logs/undercloud/home/zuul/overcloud_update_run_Controller.log
[3] https://5173a9504b4112de0add-33dcd8a3e46b9742c8c437a07c900baf.ssl.cf1.rackcdn.com/739459/18/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train/8a8b0eb/logs/undercloud/var/log/containers/mistral/package_update.log

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I think if mnesia cannot be upgraded sometimes, let's just create a pre-upgrade task to erase it (there is nothing really special to preserve it for careful upgrades)

Revision history for this message
John Eckersberg (jeckersb) wrote :

It looks like somehow the job is going backwards with the rabbitmq version, which is not going to work and produce these errors.

From the log https://5173a9504b4112de0add-33dcd8a3e46b9742c8c437a07c900baf.ssl.cf1.rackcdn.com/739459/18/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train/8a8b0eb/logs/subnode-1/var/log/containers/rabbitmq/rabbit%40centos-8-inap-mtl01-0018911735.log

I see:

2020-07-28 19:09:36.742 [info] <0.280.0>
 Starting RabbitMQ 3.8.3 on Erlang 22.3.4.1

So we've got 3.8.3 running.

Then later when it fails I see the stack trace:

Stack trace:
   [{rabbit_upgrade_functions,transform,3,
                              [{file,"src/rabbit_upgrade_functions.erl"},
                               {line,481}]},

However that stack does not match 3.8.3; line 481 in rabbit_upgrade_functions.erl is not in the transform() function on version 3.8.3.

But the train container image (tripleotrain/centos-binary-rabbitmq:current-tripleo) has version 3.6.16, which *does* match that stack trace.

So we've got 3.8.3 running, then we stop it and try to boot 3.6.16. But it can't boot because it doesn't understand how to adjust the schema that was created by the newer version.

Revision history for this message
Damien Ciabrini (dciabrin) wrote :

Looking into more details at the containers available on the overcloud node [1], it looks like the
controller node, which originally has been deployed with a centos8-based image [2] is actually being updated to a centos7-based image! [3]

This explains why we went from rabbitmq-server-3.8.3-1.el8.x86_64 down to rabbitmq-server-3.6.16-1.el7.noarch.

The job should be tweaked to forbids that, otherwise it will always fail.

[1] http://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_bc8/739217/8/check/tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train/bc8dfe5/logs/subnode-1/var/log/extra/podman/podman_allinfo.log
[2] docker.io/tripleotraincentos8/centos-binary-rabbitmq:96dc7dca942efb4750860a30e60105cc
[3] docker.io/tripleotrain/centos-binary-rabbitmq:current-tripleo

Revision history for this message
Marios Andreou (marios-b) wrote :

dciabrin & eggs thank you very much for the pointers. @damien it must be something in our tooling, possibly tripleo-upgrade - because train is a special case that has both tripleotrain and tripleotraincentos8 namespaces and something is defaulting to the 'usual' pattern of 'tripleo{{release}}' and failing.

I will dig into this more and find a fix thank you very much for your help to understand what's going on.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/745873

Revision history for this message
Marios Andreou (marios-b) wrote :

OK... so I believe the error is coming from [1] where we have:

           namespace: docker.io/tripleotrain

On deployment in this job we are setting the correct namespace with [2] but then when it prepares for the minor update (and re-runs container image prepare) it uses the defaults from [1].

I just posted https://review.opendev.org/745873 for testing but don't know if it will be acceptable.

One alternative I can think of is to have a duplicate parameter defaults file (i.e. duplicate [1]) and then use a conditional based on OS (centos 7 vs 8) in [3] to choose the right one.

[1] https://github.com/openstack/tripleo-common/blob/473d9cbc32163ce548e1483e0d5e1631c096c8d2/container-images/container_image_prepare_defaults.yaml#L16
[2] https://opendev.org/openstack/tripleo-quickstart/src/commit/7bf80f6b10fbb3b2704b1d70f993439ca4916861/config/release/tripleo-ci/CentOS-8/train.yml#L9
[3] https://github.com/openstack/tripleo-common/blob/c622351745d151e85d003eee84b33439061b8326/tripleo_common/image/kolla_builder.py#L53

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/745924

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (stable/train)

Change abandoned by Marios Andreou (<email address hidden>) on branch: stable/train
Review: https://review.opendev.org/745873
Reason: https://review.opendev.org/#/c/745924/

Revision history for this message
Marios Andreou (marios-b) wrote :

patch from Alex in comment #17 is fighting its way through gates but in meantime added it in depends-on at test https://review.opendev.org/#/c/739459/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/train)

Reviewed: https://review.opendev.org/745924
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=7f1bf79e792275ff1bf3ad3b4f97951318f87e6f
Submitter: Zuul
Branch: stable/train

commit 7f1bf79e792275ff1bf3ad3b4f97951318f87e6f
Author: Alex Schultz <email address hidden>
Date: Wed Aug 12 09:38:07 2020 -0600

    [TRAIN-Only] Dynamically add centos8 namespace

    The container namespace is different for the python3/centos8 version of
    tripleo. We need to dynamically add it as the default if not explicitly
    specified via a ContainerImagePrepare setting. This change adds an
    update to the defaults to ensure that the centos8 namespace is changed
    in the defaults only when running under python3 and the default is
    docker.io/tripleotrain. This should prevent accidental changes when a
    user specifies a specific namespace or the defaults are changed.

    Change-Id: I3ff5dc84b15a81d6afcc70ab5dde2d327b3c4714
    Related-Bug: #1887176

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/748527

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/train)

Reviewed: https://review.opendev.org/748527
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=9fd82c33b06c8748ef8c304a3419db0aea77d4eb
Submitter: Zuul
Branch: stable/train

commit 9fd82c33b06c8748ef8c304a3419db0aea77d4eb
Author: Alex Schultz <email address hidden>
Date: Thu Aug 27 13:55:12 2020 -0600

    [TRAIN-Only] Default centos8 namespace fix

    We updated some of the logic as part of
    https://review.opendev.org/#/c/745924 but we missed the default var that
    is actually used by tripleoclient and when no ContainerImagePrepare is
    provided as part of the deployment. This fixes the default var loading
    logic to update the namespace to include centos8 as necessary.

    Change-Id: I64f82c24e0b3e0479d0396a96d5f49706300e849
    Related-Bug: #1887176

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-quickstart (master)

Fix proposed to branch: master
Review: https://review.opendev.org/748664

Changed in tripleo:
assignee: Marios Andreou (marios-b) → Jose Luis Franco (jfrancoa)
status: Triaged → In Progress
Revision history for this message
Marios Andreou (marios-b) wrote :

closing this out - the job is pretty green now [1]. We merged [2] last night making it voting.
As discussed at [3] we also need that patch (it works using defaults right now but [3] will allow passing and using other values in container prepare.yaml file). [3] is currently in the gate.

[1] https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates
[2] https://review.opendev.org/#/c/742420
[3] https://review.opendev.org/748664

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Marios Andreou (marios-b) wrote :

sorry comment #24 pointed to the wrong job build - it was master - for train see

https://zuul.openstack.org/builds?job_name=tripleo-ci-centos-8-scenario000-multinode-oooq-container-updates-train#

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-quickstart (master)

Reviewed: https://review.opendev.org/748664
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart/commit/?id=28c92c2e6dc35c2b09b66cd61bef8f46d096e2b5
Submitter: Zuul
Branch: master

commit 28c92c2e6dc35c2b09b66cd61bef8f46d096e2b5
Author: Jose Luis Franco Arza <email address hidden>
Date: Fri Aug 28 15:35:42 2020 +0200

    Leave container_registry_file to identify containers file in pre-rocky jobs.

    From rocky onwards, the file used to identify the containers parameters
    passed to be containers-prepare-parameter.yaml. This file had a different
    format than in previous releases. At that point, a new variable was
    introuduced in tripleo-upgrade to identify this containers parameter file:
    uc_containers_prepare_file. To allow a transition from pre-rocky jobs to
    post-rocky, the container_registry_file parameter was left to know the
    name of the old containers prepare file. This file was removed from the
    appended overcloud_deploy.sh files to avoid issues during the upgrade/
    update prepare step [0].

    Updates/Upgrades jobs were still using the container_registry_file as
    the parameter used to set up the containers, pointing at the post-rocky
    file containers-prepare-parameter.yaml. The code in [0] then removed it
    from the appended -e environmetn files, running the overcloud update
    prepare without the right containers-prepare-parameter.yaml.

    [0] - https://review.opendev.org/#/c/717698
    Change-Id: I6816db7382e7db6d0ae644c9487b2c5ee595a441
    Closes-Bug: #1887176

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.