Instance HA could be more robust

Bug #1831234 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Michele Baldessari

Bug Description

The 'compute-unfence-trigger' resource is simply a dummy resource thatis only used to trigger unfence events. If for whatever reason this resource is having issues (pcmk bug, node overloaded, etc.) it makes zero sense to have pacemaker do a fencing action against the node. It brings us very little and is quite invasive.

We can be smarter about it and make it so that a failure on the unfence-trigger is non-disruptive.

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Michele Baldessari (michele) wrote :
Changed in tripleo:
milestone: train-1 → train-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/661702
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=8d2c3a0e6e392e2f358bf29a3b900afcd5bbe56a
Submitter: Zuul
Branch: master

commit 8d2c3a0e6e392e2f358bf29a3b900afcd5bbe56a
Author: Michele Baldessari <email address hidden>
Date: Tue May 28 10:50:16 2019 +0200

    IHA robustness improvements

    This will avoid useless fencing events in case of stonith problems. The
    'compute-unfence-trigger' resource is simply a dummy resource that is
    only used to trigger unfence events. If for whatever reason this
    resource is having issues on stop (pcmk bug, node overloaded, etc.) it
    makes zero sense to have pacemaker do a fencing action against the node.
    Let's just block and show the operator the status and be less harsh in
    general.

    Tested this and I correctly get the following:
    [root@controller-0 ~]# pcs resource show compute-unfence-trigger-clone
     Clone: compute-unfence-trigger-clone
      Resource: compute-unfence-trigger (class=ocf provider=pacemaker type=Dummy)
       Meta Attrs: requires=unfencing
       Operations: migrate_from interval=0s timeout=20 (compute-unfence-trigger-migrate_from-interval-0s)
                   migrate_to interval=0s timeout=20 (compute-unfence-trigger-migrate_to-interval-0s)
                   monitor interval=10 timeout=20 (compute-unfence-trigger-monitor-interval-10)
                   reload interval=0s timeout=20 (compute-unfence-trigger-reload-interval-0s)
                   start interval=0s timeout=20 (compute-unfence-trigger-start-interval-0s)
                   stop interval=0s on-fail=block timeout=20 (compute-unfence-trigger-stop-interval-0s)

    Closes-Bug: #1831234

    Change-Id: Ib4884078d54c25da998495ba3e8d47b5e17010ce

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/665589

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/stein)

Reviewed: https://review.opendev.org/665589
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=a2ceac8bafedcedae20811a78a6ccb3823abf36d
Submitter: Zuul
Branch: stable/stein

commit a2ceac8bafedcedae20811a78a6ccb3823abf36d
Author: Michele Baldessari <email address hidden>
Date: Tue May 28 10:50:16 2019 +0200

    IHA robustness improvements

    This will avoid useless fencing events in case of stonith problems. The
    'compute-unfence-trigger' resource is simply a dummy resource that is
    only used to trigger unfence events. If for whatever reason this
    resource is having issues on stop (pcmk bug, node overloaded, etc.) it
    makes zero sense to have pacemaker do a fencing action against the node.
    Let's just block and show the operator the status and be less harsh in
    general.

    Tested this and I correctly get the following:
    [root@controller-0 ~]# pcs resource show compute-unfence-trigger-clone
     Clone: compute-unfence-trigger-clone
      Resource: compute-unfence-trigger (class=ocf provider=pacemaker type=Dummy)
       Meta Attrs: requires=unfencing
       Operations: migrate_from interval=0s timeout=20 (compute-unfence-trigger-migrate_from-interval-0s)
                   migrate_to interval=0s timeout=20 (compute-unfence-trigger-migrate_to-interval-0s)
                   monitor interval=10 timeout=20 (compute-unfence-trigger-monitor-interval-10)
                   reload interval=0s timeout=20 (compute-unfence-trigger-reload-interval-0s)
                   start interval=0s timeout=20 (compute-unfence-trigger-start-interval-0s)
                   stop interval=0s on-fail=block timeout=20 (compute-unfence-trigger-stop-interval-0s)

    Closes-Bug: #1831234

    Change-Id: Ib4884078d54c25da998495ba3e8d47b5e17010ce
    (cherry picked from commit 8d2c3a0e6e392e2f358bf29a3b900afcd5bbe56a)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/666405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/rocky)

Reviewed: https://review.opendev.org/666405
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=275d219b0d9183f4a027dbf418ffaf368dd652c8
Submitter: Zuul
Branch: stable/rocky

commit 275d219b0d9183f4a027dbf418ffaf368dd652c8
Author: Michele Baldessari <email address hidden>
Date: Tue May 28 10:50:16 2019 +0200

    IHA robustness improvements

    This will avoid useless fencing events in case of stonith problems. The
    'compute-unfence-trigger' resource is simply a dummy resource that is
    only used to trigger unfence events. If for whatever reason this
    resource is having issues on stop (pcmk bug, node overloaded, etc.) it
    makes zero sense to have pacemaker do a fencing action against the node.
    Let's just block and show the operator the status and be less harsh in
    general.

    Tested this and I correctly get the following:
    [root@controller-0 ~]# pcs resource show compute-unfence-trigger-clone
     Clone: compute-unfence-trigger-clone
      Resource: compute-unfence-trigger (class=ocf provider=pacemaker type=Dummy)
       Meta Attrs: requires=unfencing
       Operations: migrate_from interval=0s timeout=20 (compute-unfence-trigger-migrate_from-interval-0s)
                   migrate_to interval=0s timeout=20 (compute-unfence-trigger-migrate_to-interval-0s)
                   monitor interval=10 timeout=20 (compute-unfence-trigger-monitor-interval-10)
                   reload interval=0s timeout=20 (compute-unfence-trigger-reload-interval-0s)
                   start interval=0s timeout=20 (compute-unfence-trigger-start-interval-0s)
                   stop interval=0s on-fail=block timeout=20 (compute-unfence-trigger-stop-interval-0s)

    Closes-Bug: #1831234

    Change-Id: Ib4884078d54c25da998495ba3e8d47b5e17010ce
    (cherry picked from commit 8d2c3a0e6e392e2f358bf29a3b900afcd5bbe56a)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/668254

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/queens)

Reviewed: https://review.opendev.org/668254
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=d64f7bfbb422da62306be9d7fa60d03db29754e1
Submitter: Zuul
Branch: stable/queens

commit d64f7bfbb422da62306be9d7fa60d03db29754e1
Author: Michele Baldessari <email address hidden>
Date: Tue May 28 10:50:16 2019 +0200

    IHA robustness improvements

    This will avoid useless fencing events in case of stonith problems. The
    'compute-unfence-trigger' resource is simply a dummy resource that is
    only used to trigger unfence events. If for whatever reason this
    resource is having issues on stop (pcmk bug, node overloaded, etc.) it
    makes zero sense to have pacemaker do a fencing action against the node.
    Let's just block and show the operator the status and be less harsh in
    general.

    Tested this and I correctly get the following:
    [root@controller-0 ~]# pcs resource show compute-unfence-trigger-clone
     Clone: compute-unfence-trigger-clone
      Resource: compute-unfence-trigger (class=ocf provider=pacemaker type=Dummy)
       Meta Attrs: requires=unfencing
       Operations: migrate_from interval=0s timeout=20 (compute-unfence-trigger-migrate_from-interval-0s)
                   migrate_to interval=0s timeout=20 (compute-unfence-trigger-migrate_to-interval-0s)
                   monitor interval=10 timeout=20 (compute-unfence-trigger-monitor-interval-10)
                   reload interval=0s timeout=20 (compute-unfence-trigger-reload-interval-0s)
                   start interval=0s timeout=20 (compute-unfence-trigger-start-interval-0s)
                   stop interval=0s on-fail=block timeout=20 (compute-unfence-trigger-stop-interval-0s)

    Closes-Bug: #1831234

    Change-Id: Ib4884078d54c25da998495ba3e8d47b5e17010ce
    (cherry picked from commit 8d2c3a0e6e392e2f358bf29a3b900afcd5bbe56a)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 10.5.0

This issue was fixed in the openstack/puppet-tripleo 10.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 11.1.0

This issue was fixed in the openstack/puppet-tripleo 11.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.5.1

This issue was fixed in the openstack/puppet-tripleo 9.5.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 8.5.1

This issue was fixed in the openstack/puppet-tripleo 8.5.1 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.