centos8 standalone-upgrade ussuri job Failed container(s): ['haproxy_restart_bundle

Bug #1889395 reported by Marios Andreou
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Damien Ciabrini

Bug Description

centos8 standalone-upgrade ussuri job fails at [1] during the post upgrade deployment steps with trace like:

        * 2020-07-28 21:57:16 | [ERROR]: Container(s) which finished with wrong return code:
        2020-07-28 21:57:16 | ['haproxy_restart_bundle']
        2020-07-28 21:57:16 | 2020-07-28 21:57:16.141685 | bc764e10-2a1a-a56a-2191-000000002cbf | FATAL | Check containers status | standalone | error={"changed": false, "msg": "Failed container(s): ['haproxy_restart_bundle'], check logs in /var/log/containers/stdouts/"}

This was seen in the test at [2] after fixing a different bug [3]. Looking at the haproxy stdouts [4] it actually can't find haproxy-bundle

        * 2020-07-28T21:57:09.678381400+00:00 stdout F Tue Jul 28 21:57:09 UTC 2020: Restarting haproxy-bundle globally
        2020-07-28T21:57:10.215468255+00:00 stderr F Error: Error performing operation: No such device or address
        2020-07-28T21:57:10.215468255+00:00 stderr F haproxy-bundle is not running anywhere and so cannot be restarted

[1] https://76bd6a632dcf869ef49b-9d73aaaa1727b5f44155748d9566e05a.ssl.cf2.rackcdn.com/739457/13/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/a6a63ab/logs/undercloud/home/zuul/standalone_upgrade.log
[2] https://review.opendev.org/#/c/739457
[3] https://bugs.launchpad.net/tripleo/+bug/1887159/comments/7
[4] https://76bd6a632dcf869ef49b-9d73aaaa1727b5f44155748d9566e05a.ssl.cf2.rackcdn.com/739457/13/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/a6a63ab/logs/undercloud/var/log/containers/stdouts/haproxy_restart_bundle.log

Revision history for this message
Marios Andreou (marios-b) wrote :

adding some more pointers to logs from yesterday - I only had one example when I filed this bug but it is definitely confirmed see [1]

        * 2020-07-29 13:56:44 | 2020-07-29 13:56:44.479766 | fa163e4d-0853-d9b1-ec78-000000002cbd | FATAL | Check containers status | standalone | error={"changed": false, "msg": "Failed container(s): ['haproxy_restart_bundle'], check logs in /var/log/containers/stdouts/"}

[1] https://df084da6644b38328cd6-e9f29c7afce5197c5c20e02f6b6da59e.ssl.cf2.rackcdn.com/739457/13/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/7ee96ea/logs/undercloud/home/zuul/standalone_upgrade.log

wes hayutin (weshayutin)
tags: added: alert
description: updated
Revision history for this message
Damien Ciabrini (dciabrin) wrote :

Thanks for filing it, I'm looking into it right now.

On standalone deployment, the haproxy container is not started properly because of the way the haproxy config and the VIP are configured. Since it's a standalone deployment, this is not a immediate issue because the services aren't consumed directly via haproxy. We are looking into this specific concern with Michele.

What's happening in the failed job is that on upgrade, a technical container "haproxy_restart_bundle" tries to run "pcs resource restart haproxy" while the resource isn't started, and that shouldn't happen.

I'm having a look at the shell script which handles that restart (pacemaker_restart_bundle.sh), I'll add comments once I figured out the exact problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/744529

Changed in tripleo:
assignee: Marios Andreou (marios-b) → Damien Ciabrini (dciabrin)
status: Triaged → In Progress
Revision history for this message
Damien Ciabrini (dciabrin) wrote :

So the container error can be fixed by calling the appropriate pcs command:

  . "resource restart" when there are running containers,
  . "resource cleanup" when all the containers errored out.

This will fix CI and will give us time to fix the HAProxy service on standalone separately.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/744674

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/ussuri)

Change abandoned by Sergii Golovatiuk (<email address hidden>) on branch: stable/ussuri
Review: https://review.opendev.org/744674

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/744675

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/ussuri)

Change abandoned by Sergii Golovatiuk (<email address hidden>) on branch: stable/ussuri
Review: https://review.opendev.org/744675

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/744676

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/744529
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=ba471ee461b125e2aa53c485ab61dc467bf7d858
Submitter: Zuul
Branch: master

commit ba471ee461b125e2aa53c485ab61dc467bf7d858
Author: Damien Ciabrini <email address hidden>
Date: Mon Aug 3 18:59:44 2020 +0200

    Fix HA resource restart when no replicas are running

    When the helper script pacemaker_restart_bundle.sh is called
    during a stack update, it restarts the pacemaker resource via
    a "pcs resource restart <name>".

    When all the replicas are stopped due to a previous error,
    pcs won't restart them because there is nothing to stop. In
    that case, one must use "pcs resource cleanup <name>".

    Change-Id: I1790444d289d057e9a3f612c53efe485080978b5
    Closes-Bug: #1889395

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ussuri)

Reviewed: https://review.opendev.org/744676
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=7f3eb2371d7a91245cbd7ffc14992a4a5ccd5950
Submitter: Zuul
Branch: stable/ussuri

commit 7f3eb2371d7a91245cbd7ffc14992a4a5ccd5950
Author: Damien Ciabrini <email address hidden>
Date: Mon Aug 3 18:59:44 2020 +0200

    Fix HA resource restart when no replicas are running

    When the helper script pacemaker_restart_bundle.sh is called
    during a stack update, it restarts the pacemaker resource via
    a "pcs resource restart <name>".

    When all the replicas are stopped due to a previous error,
    pcs won't restart them because there is nothing to stop. In
    that case, one must use "pcs resource cleanup <name>".

    (cherry picked from commit ba471ee461b125e2aa53c485ab61dc467bf7d858)

    Closes-Bug: #1889395
    Change-Id: I1790444d289d057e9a3f612c53efe485080978b5

tags: added: in-stable-ussuri
Revision history for this message
Damien Ciabrini (dciabrin) wrote :

There are still two issues to get this job passing.

1) Another patch to land for the upgrade to move forward is https://review.opendev.org/#/c/725782/.
With that one, the haproxy service correctly listens to the controller VIP created by pacemaker, and the upgrade can stop and restart that service without problem.

2) In Master, all the user's uid/gid have changed, they no longer use kolla's uid/gid. The first service to break during ussuri->master upgrade is rabbitmq, because on restart, rabbitmq will try to read a config file belonging to old a kolla UID and will get a permission denied.

After changing the ownership of the rrabbitmq config file manually, the next upgrade failure happens in nova_api_db_sync, because this time the shellscript can't write into the nova log file anymore...

So I believe we can't really fix that job until 2) is fixed.
I feel like having a massive pass chown all the file is going to be a bit impractical, so I wonder if sticking to old kolla UID/GID is not a simpler way forward?

Revision history for this message
Marios Andreou (marios-b) wrote :

o/ dciabrin - I have had a couple of green runs in my Ussuri test at [1] - logs at [2][3] - the test includes the fixes for this bug that are already merged and the fix for the other bug we are tracking for ussuri standalone-upgrade at https://bugs.launchpad.net/tripleo/+bug/1887159 (i.e. fix is [4]).

WRT comment #12 above - do you still think we need more work here i.e. are the passing tests at [2][3] false positives?

thanks for checking

[1] https://review.opendev.org/#/c/739457/
[2] https://63bcf7c04a36b258ddfb-d8efb2c8dc9c4e35e02c540c6f97e75e.ssl.cf5.rackcdn.com/739457/14/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/38b9bd0/
[3] https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_318/739457/14/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/3182c17/
[4] https://review.opendev.org/#/c/742418/8/zuul.d/standalone-jobs.yaml@138

Revision history for this message
Damien Ciabrini (dciabrin) wrote :

I think I mixed up the upgrade jobs here. I was thinking about Sergii's upgrade job that exercise ussuri->master. Those ones are currently broken for the two reasons I mentioned in comment #12 (haproxy broken + UID/GID change in master).

For the job that you're mentioning in comment #13, it looks to me that they are train->ussuri. Those one are passing fine but the haproxy service is still broken in deployment (train) and upgrade (ussuri), due to how we misconfigure the control plane VIP in HA standalone. That one would be fixed once https://review.opendev.org/#/c/725782/ lands.

So I think this launchpad originally tracked a problem that's been fixed (restarting haproxy when it's not running), so if you want we can close it and I'll change the Related-Bug in https://review.opendev.org/#/c/725782/, because it's effectively tracking another haproxy issue.

Revision history for this message
Emilien Macchi (emilienm) wrote :

Damien, we should address the GID/UID issue with https://review.opendev.org/#/c/745575/

Revision history for this message
Marios Andreou (marios-b) wrote :

thanks for looking damien - so to confirm: the job is green but really it shouldn't be - haproxy is broken and will be fixed with https://review.opendev.org/#/c/725782/.

I really don't mind with the bug - up to you. If you file a new one please add a note here so I can add the new related-bug into https://review.opendev.org/#/c/742418/ .

I mean they're both *haproxy* issues so ... really up to you.

Revision history for this message
Marios Andreou (marios-b) wrote :

added the patch at https://review.opendev.org/#/c/725782/ as depends-on in the ussuri test @ https://review.opendev.org/739457 so we can test it there FYI

Revision history for this message
Marios Andreou (marios-b) wrote :

following from comment #17 above we had a green run there:

https://51942374a8f4d0e6e94a-de2a4e9610e68e853bfff1f1436a242e.ssl.cf1.rackcdn.com/739457/15/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/957f05d/

would be great if someone can verify haproxy is OK now... the job was also previously green (see comment #13 above)

Revision history for this message
Damien Ciabrini (dciabrin) wrote :

Unfortunately, haproxy didn't start [1]. And looking at logs [2], it seems to me that the Depends-On didn't work.

Deployment line I see in the logs:

 "_deploy_cmd": "openstack tripleo deploy --templates $DEPLOY_TEMPLATES --standalone --yes --output-dir $DEPLOY_OUTPUT_DIR --stack $DEPLOY_STACK --standalone-role $DEPLOY_STANDALONE_ROLE --timeout $DEPLOY_TIMEOUT_ARG -e /usr/share/openstack-tripleo-heat-templates/environments/standalone/standalone-tripleo.yaml -e /home/zuul/containers-prepare-parameters.yaml -e /home/zuul/standalone_parameters.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/docker-ha.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/podman.yaml -r $DEPLOY_ROLES_FILE --deployment-user $DEPLOY_DEPLOYMENT_USER --local-ip $DEPLOY_LOCAL_IP >/home/zuul/standalone_deploy.log 2>&1"

It doesn't use the "--control-virtual-ip" override. With https://review.opendev.org/#/c/725782/, one should see:

TASK [standalone : Check whether control plane defaults to HA] **************************************************************
[...]
followed by .e.g:

TASK [tripleo.operator.tripleo_deploy : Setup standalone deploy facts] ************************************************************************************************************************************************
task path: /tmp/bruce-ha/hab-03/run/share/ansible/collections/ansible_collections/tripleo/operator/roles/tripleo_deploy/tasks/main.yml:8
Monday 17 August 2020 06:57:07 -0400 (0:00:00.029) 0:02:19.347 *********
ok: [undercloud] => {
    "ansible_facts": {
        "_deploy_cmd": "openstack tripleo deploy --templates $DEPLOY_TEMPLATES --standalone --yes --output-dir $DEPLOY_OUTPUT_DIR --stack $DEPLOY_STACK --standalone-role $DEPLOY_STANDALONE_ROLE --timeout $DEPLOY_
TIMEOUT_ARG -e /usr/share/openstack-tripleo-heat-templates/environments/standalone/standalone-tripleo.yaml -e /home/stack/containers-prepare-parameters.yaml -e /home/stack/standalone_parameters.yaml -r $DEPLOY_ROLES
_FILE --deployment-user $DEPLOY_DEPLOYMENT_USER --local-ip $DEPLOY_LOCAL_IP --control-virtual-ip $DEPLOY_CONTROL_VIP >/home/stack/standalone_deploy.log 2>&1",

[1] https://51942374a8f4d0e6e94a-de2a4e9610e68e853bfff1f1436a242e.ssl.cf1.rackcdn.com/739457/15/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/957f05d/logs/undercloud/var/log/containers/stdouts/haproxy-bundle.log
[2] https://51942374a8f4d0e6e94a-de2a4e9610e68e853bfff1f1436a242e.ssl.cf1.rackcdn.com/739457/15/check/tripleo-ci-centos-8-standalone-upgrade-ussuri/957f05d/job-output.txt

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-quickstart-extras (master)

Reviewed: https://review.opendev.org/725782
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart-extras/commit/?id=63c1365d94b2246a0be8580026aff001e6d8fce8
Submitter: Zuul
Branch: master

commit 63c1365d94b2246a0be8580026aff001e6d8fce8
Author: Michele Baldessari <email address hidden>
Date: Sat May 9 16:10:18 2020 +0200

    Pass proper VIPs on Standalone

    With HA being the default we should now specify the VIPs as well.
    This way pacemaker will create the separate VIP and haproxy will listen
    to it and we won't have both haproxy and the service itself listening to
    the same port on the same IP.

    Note: that we tried to make this work without creating a VIP at all and
    having haproxy's backend services listen to localhost only and leave
    haproxy to listen on the standalone IP, but this quickly became
    impossible to make it work due to how things are coded in a number
    of places (bootstrap hostname checks, pacemaker properties,etc.)

    Related-Bug: #1889395

    Change-Id: I367cf4b65300be8dca0190b9adeab549018d4a56

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/746957

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/746957
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=fffbdc0df115ed7f442620e4fff22fcacd595e70
Submitter: Zuul
Branch: stable/train

commit fffbdc0df115ed7f442620e4fff22fcacd595e70
Author: Damien Ciabrini <email address hidden>
Date: Mon Aug 3 18:59:44 2020 +0200

    Fix HA resource restart when no replicas are running

    When the helper script pacemaker_restart_bundle.sh is called
    during a stack update, it restarts the pacemaker resource via
    a "pcs resource restart <name>".

    When all the replicas are stopped due to a previous error,
    pcs won't restart them because there is nothing to stop. In
    that case, one must use "pcs resource cleanup <name>".

    (cherry picked from commit ba471ee461b125e2aa53c485ab61dc467bf7d858)

    Closes-Bug: #1889395
    Change-Id: I1790444d289d057e9a3f612c53efe485080978b5

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-operator-ansible (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/750995

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-operator-ansible (master)

Reviewed: https://review.opendev.org/750995
Committed: https://git.openstack.org/cgit/openstack/tripleo-operator-ansible/commit/?id=5fa02569c2047f2ffae1bdf69914d5dba1ad0443
Submitter: Zuul
Branch: master

commit 5fa02569c2047f2ffae1bdf69914d5dba1ad0443
Author: Alex Schultz <email address hidden>
Date: Thu Sep 10 07:54:40 2020 -0600

    Add control vip to standalone playbook

    Since we've switched to pacemaker by default, we need to specify the
    control vip.

    Change-Id: I2ef813508fbb7cb54ca76a95c5c3997fce6a8b9d
    Related-Bug: #1889395

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.4.0

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers