Fix up a race when deploying pacemaker_remote nodes

Bug #1689028 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Michele Baldessari

Bug Description

We currently create remote resources without waiting for their creation.
This leads to the following potential race (spotted by Marian Mkrcmari):
- On Step1 pacemaker bootstrap node creates the resource but the remote
  resource is not yet created
- Step1 completes and Step2 starts
- On Step2 the remote node sets a property (or calls pcs cib) but the
  remote is not yet set up so 'pcs cluster cib' will fail there with:

    (err): Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster
    cib /var/lib/pacemaker/cib/puppet-cib-backup20170506-15994-1swnk1i failed
    with code: 1 ->

I am not entirely sure why we started seeing this only now. The suspicion is that it
is for the same reason for which we started to see https://launchpad.net/bugs/1688322
only lately. Likely some puppet dependencies changed the ordering of execution
and broke some assumptions.

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/463103
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=b6d02fd5001153b53b3061d63d2cb686b0646f18
Submitter: Jenkins
Branch: master

commit b6d02fd5001153b53b3061d63d2cb686b0646f18
Author: Michele Baldessari <email address hidden>
Date: Sat May 6 17:40:24 2017 +0200

    Use verify_on_create when creating pacemaker remote resources

    We currently create remote resources without waiting for their creation.
    This leads to the following potential race (spotted by Marian Mkrcmari):
    - On Step1 pacemaker bootstrap node creates the resource but the remote
      resource is not yet created
    - Step1 completes and Step2 starts
    - On Step2 the remote node sets a property (or calls pcs cib) but the
      remote is not yet set up so 'pcs cluster cib' will fail there with:

        (err): Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster
        cib /var/lib/pacemaker/cib/puppet-cib-backup20170506-15994-1swnk1i failed
        with code: 1 ->

    Note that when verify_on_create is set to true we are not using the cib
    dump/push mechanism. That is fine because we create the remotes on
    step1 and the dump/push mechanism is only needed starting from step2
    when multiple nodes set cluster properties at the same time.

    Tested by Marian Mkrcmari successfully as well.

    Closes-Bug: #1689028

    Change-Id: I764526b3f3c06591d477cc92779d83a19802368e
    Depends-On: I1db31dcc92b8695ab0522bba91df729b37f34e0f

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/466194

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/ocata)

Reviewed: https://review.openstack.org/466194
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=72392a373ca3c0f2b4ea6887fde19cf886288dc4
Submitter: Jenkins
Branch: stable/ocata

commit 72392a373ca3c0f2b4ea6887fde19cf886288dc4
Author: Michele Baldessari <email address hidden>
Date: Sat May 6 17:40:24 2017 +0200

    Use verify_on_create when creating pacemaker remote resources

    We currently create remote resources without waiting for their creation.
    This leads to the following potential race (spotted by Marian Mkrcmari):
    - On Step1 pacemaker bootstrap node creates the resource but the remote
      resource is not yet created
    - Step1 completes and Step2 starts
    - On Step2 the remote node sets a property (or calls pcs cib) but the
      remote is not yet set up so 'pcs cluster cib' will fail there with:

        (err): Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster
        cib /var/lib/pacemaker/cib/puppet-cib-backup20170506-15994-1swnk1i failed
        with code: 1 ->

    Note that when verify_on_create is set to true we are not using the cib
    dump/push mechanism. That is fine because we create the remotes on
    step1 and the dump/push mechanism is only needed starting from step2
    when multiple nodes set cluster properties at the same time.

    Tested by Marian Mkrcmari successfully as well.

    Closes-Bug: #1689028

    Change-Id: I764526b3f3c06591d477cc92779d83a19802368e
    Depends-On: I1db31dcc92b8695ab0522bba91df729b37f34e0f
    (cherry picked from commit b6d02fd5001153b53b3061d63d2cb686b0646f18)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 7.1.0

This issue was fixed in the openstack/puppet-tripleo 7.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 6.5.0

This issue was fixed in the openstack/puppet-tripleo 6.5.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.