pcs on host patchset triggers a problem during FFU on IHA deployments

Bug #1923723 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Unassigned

Bug Description

With the merging of the pcs on host patchset for 16.2 we are seeing a problem with FFUs on Instance HA environments.

Preamble: tripleo keeps the stonith-enabled cluster property set to false until step 5.

With the pcs on host patchset the enablement happens still at step 5 but it gets triggered during tripleo_ha_wrapper deployment task of cinder-volume which tries to restart the cinder-volume service (during the leapp of the first controller) and this hangs forever because pacemaker is in the following transition:
- stonith-fence_compute-fence-nova is configured
- pacemaker wants to call stonith on for controller-0 (which is probably dumb, but it is unlikely we'll be able to change that in the right timeframe as it seems a potentially involved change in behaviour)
- Any other action, like cinder-volume restart in this case, is stuck and the FFU fails.

If we simply move the stonith resource creation (and change nothing else in the stonith-enabled property being set at step 5) to step 2, we should be good.

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/puppet-tripleo/+/786114

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/puppet-tripleo/+/786115

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/c/openstack/puppet-tripleo/+/785862
Committed: https://opendev.org/openstack/puppet-tripleo/commit/59076017bb7d0cd5f67d9f8b38711255a32a3868
Submitter: "Zuul (22348)"
Branch: master

commit 59076017bb7d0cd5f67d9f8b38711255a32a3868
Author: Michele Baldessari <email address hidden>
Date: Mon Apr 12 14:22:58 2021 +0200

    Move stonith resource creation to step2

    With the merging of the pcs on host patchset for train we are seeing a
    problem with FFUs on Instance HA environments.

    Preamble:
    Tripleo keeps the stonith-enabled cluster property set to false until the puppet step 5

    With the pcs on host patchset the enablement happens still at step 5 but
    it gets triggered during tripleo_ha_wrapper deployment task of
    cinder-volume which tries to restart the cinder-volume service (during
    the leapp of the first controller) and this hangs forever because
    pacemaker is in the following transition:
    - stonith-fence_compute-fence-nova is configured
    - pacemaker wants to call stonith on for controller-0 (which is probably
      dumb, but it is unlikely we'll be able to change that in the right
      timeframe as it seems a potentially involved change in behaviour)
    - Any other action, like cinder-volume restart in this case, is stuck
      and the FFU fails.

    If we simply move the stonith resource creation (and change nothing else
    in the stonith-enabled property being set at step 5) to step 2, we
    fix this.

    Tested and with the injection of this puppet-tripleo review into the
    FFU queens->train upgrade on an IHA system, now the FFU passes.
    Also applied this patch to a Train based IHA deployment and verified
    that deployment, redeploy, minor update and scaleup all keep on working.

    Closes-Bug: #1923723

    Change-Id: Ib3e2d9c93221dfc2e15974142f30e8c84e7afd63

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/puppet-tripleo/+/786114
Committed: https://opendev.org/openstack/puppet-tripleo/commit/b71bcbf982e3b5585e4abe8770977bf6e999624b
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit b71bcbf982e3b5585e4abe8770977bf6e999624b
Author: Michele Baldessari <email address hidden>
Date: Mon Apr 12 14:22:58 2021 +0200

    Move stonith resource creation to step2

    With the merging of the pcs on host patchset for train we are seeing a
    problem with FFUs on Instance HA environments.

    Preamble:
    Tripleo keeps the stonith-enabled cluster property set to false until the puppet step 5

    With the pcs on host patchset the enablement happens still at step 5 but
    it gets triggered during tripleo_ha_wrapper deployment task of
    cinder-volume which tries to restart the cinder-volume service (during
    the leapp of the first controller) and this hangs forever because
    pacemaker is in the following transition:
    - stonith-fence_compute-fence-nova is configured
    - pacemaker wants to call stonith on for controller-0 (which is probably
      dumb, but it is unlikely we'll be able to change that in the right
      timeframe as it seems a potentially involved change in behaviour)
    - Any other action, like cinder-volume restart in this case, is stuck
      and the FFU fails.

    If we simply move the stonith resource creation (and change nothing else
    in the stonith-enabled property being set at step 5) to step 2, we
    fix this.

    Tested and with the injection of this puppet-tripleo review into the
    FFU queens->train upgrade on an IHA system, now the FFU passes.
    Also applied this patch to a Train based IHA deployment and verified
    that deployment, redeploy, minor update and scaleup all keep on working.

    Closes-Bug: #1923723

    Change-Id: Ib3e2d9c93221dfc2e15974142f30e8c84e7afd63

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/train)

Reviewed: https://review.opendev.org/c/openstack/puppet-tripleo/+/785863
Committed: https://opendev.org/openstack/puppet-tripleo/commit/bd1807c48bf1d020eac39699395839b918ea49a5
Submitter: "Zuul (22348)"
Branch: stable/train

commit bd1807c48bf1d020eac39699395839b918ea49a5
Author: Michele Baldessari <email address hidden>
Date: Mon Apr 12 14:22:58 2021 +0200

    Move stonith resource creation to step2

    With the merging of the pcs on host patchset for train we are seeing a
    problem with FFUs on Instance HA environments.

    Preamble:
    Tripleo keeps the stonith-enabled cluster property set to false until the puppet step 5

    With the pcs on host patchset the enablement happens still at step 5 but
    it gets triggered during tripleo_ha_wrapper deployment task of
    cinder-volume which tries to restart the cinder-volume service (during
    the leapp of the first controller) and this hangs forever because
    pacemaker is in the following transition:
    - stonith-fence_compute-fence-nova is configured
    - pacemaker wants to call stonith on for controller-0 (which is probably
      dumb, but it is unlikely we'll be able to change that in the right
      timeframe as it seems a potentially involved change in behaviour)
    - Any other action, like cinder-volume restart in this case, is stuck
      and the FFU fails.

    If we simply move the stonith resource creation (and change nothing else
    in the stonith-enabled property being set at step 5) to step 2, we
    fix this.

    Tested and with the injection of this puppet-tripleo review into the
    FFU queens->train upgrade on an IHA system, now the FFU passes.
    Also applied this patch to a Train based IHA deployment and verified
    that deployment, redeploy, minor update and scaleup all keep on working.

    Closes-Bug: #1923723

    Change-Id: Ib3e2d9c93221dfc2e15974142f30e8c84e7afd63
    (cherry picked from commit 6196157b54efb2c0bdd1c0803f4fcd10e9a18d84)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/puppet-tripleo/+/786115
Committed: https://opendev.org/openstack/puppet-tripleo/commit/d9857e6f97157b3b4e0080d08fd407f92484d66f
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit d9857e6f97157b3b4e0080d08fd407f92484d66f
Author: Michele Baldessari <email address hidden>
Date: Mon Apr 12 14:22:58 2021 +0200

    Move stonith resource creation to step2

    With the merging of the pcs on host patchset for train we are seeing a
    problem with FFUs on Instance HA environments.

    Preamble:
    Tripleo keeps the stonith-enabled cluster property set to false until the puppet step 5

    With the pcs on host patchset the enablement happens still at step 5 but
    it gets triggered during tripleo_ha_wrapper deployment task of
    cinder-volume which tries to restart the cinder-volume service (during
    the leapp of the first controller) and this hangs forever because
    pacemaker is in the following transition:
    - stonith-fence_compute-fence-nova is configured
    - pacemaker wants to call stonith on for controller-0 (which is probably
      dumb, but it is unlikely we'll be able to change that in the right
      timeframe as it seems a potentially involved change in behaviour)
    - Any other action, like cinder-volume restart in this case, is stuck
      and the FFU fails.

    If we simply move the stonith resource creation (and change nothing else
    in the stonith-enabled property being set at step 5) to step 2, we
    fix this.

    Tested and with the injection of this puppet-tripleo review into the
    FFU queens->train upgrade on an IHA system, now the FFU passes.
    Also applied this patch to a Train based IHA deployment and verified
    that deployment, redeploy, minor update and scaleup all keep on working.

    Closes-Bug: #1923723

    Change-Id: Ib3e2d9c93221dfc2e15974142f30e8c84e7afd63

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 14.1.0

This issue was fixed in the openstack/puppet-tripleo 14.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 13.6.2

This issue was fixed in the openstack/puppet-tripleo 13.6.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 11.7.0

This issue was fixed in the openstack/puppet-tripleo 11.7.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 12.6.1

This issue was fixed in the openstack/puppet-tripleo 12.6.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.