remote race

Bug #1807906 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
puppet-pacemaker
Fix Released
Undecided
Michele Baldessari
tripleo
Fix Released
High
Michele Baldessari

Bug Description

There is still a pcmk remote race inside the setup code.
So when we landed https://review.openstack.org/#/c/569565/ we created
the remote authkey file with the following constraints:
  Exec["Create Cluster ${cluster_name}"] -> File['etc-pacemaker-authkey']
  File['etc-pacemaker-authkey'] -> Exec["Start Cluster ${cluster_name}"]

This was because pcs, at the time, would remove the authkey when calling
cluster setup. pcs has now been fixed to not remove this key anylonger
and so we actually want it create as one of the very first things. I.e.
even before pcsd starts.

Changed in puppet-pacemaker:
assignee: nobody → Michele Baldessari (michele)
Changed in tripleo:
importance: Undecided → High
assignee: nobody → Michele Baldessari (michele)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-pacemaker (master)

Fix proposed to branch: master
Review: https://review.openstack.org/624349

Changed in puppet-pacemaker:
status: New → In Progress
tags: added: queens-backport-potential rocky-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/624352

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-pacemaker (master)

Reviewed: https://review.openstack.org/624349
Committed: https://git.openstack.org/cgit/openstack/puppet-pacemaker/commit/?id=7b55ac38ecd2b7bfcdf43578511df54b20d775da
Submitter: Zuul
Branch: master

commit 7b55ac38ecd2b7bfcdf43578511df54b20d775da
Author: Michele Baldessari <email address hidden>
Date: Tue Dec 11 11:25:18 2018 +0100

    Fix up ordering of remote authkeys and a couple of pcs commands

    So when we landed https://review.openstack.org/#/c/569565/ we created
    the remote authkey file with the following constraints:
      Exec["Create Cluster ${cluster_name}"] -> File['etc-pacemaker-authkey']
      File['etc-pacemaker-authkey'] -> Exec["Start Cluster ${cluster_name}"]

    This was because pcs, at the time, would remove the authkey when calling
    cluster setup. pcs has now been fixed to not remove this key anylonger
    and so we actually want it create as one of the very first things. I.e.
    even before pcsd starts.

    That way we have the guarantee that pcs is aware of it and will not
    remove it when destroying the cluster [1].

    This will remove the error messages that were seen on the remotes for a
    certain amount of time (until pacemaker decided to reread the authkey
    from disk and retry the connection with the new credentials):
    pacemaker_remoted[21460]: notice: LRMD client connection established. 0x55d7f48bdad0 id: e662d8b9-c353-4e0e-9818-158812fedd34
    pacemaker_remoted[21460]: error: TLS handshake with Pacemaker Remote failed: Decryption has failed.

    While we're at it we need to make every pcs auth command explicitely
    require Service['pcsd']. Right now this works by pure accident, those
    commands do fail if puppet decides to order them before pcsd is up and
    running.

    Closes-Bug: #1807906

    [1] rhbz#1459503

    Change-Id: I7164787205d2994e5949c29f756658d6392d7a4c

Changed in puppet-pacemaker:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on puppet-tripleo (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.openstack.org/624352
Reason: failing in gate due to bug 1808883

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/624352
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=ce6df5888ddadcb6cfeed1db588b03cbf3fc103c
Submitter: Zuul
Branch: master

commit ce6df5888ddadcb6cfeed1db588b03cbf3fc103c
Author: Michele Baldessari <email address hidden>
Date: Tue Dec 11 11:56:25 2018 +0100

    Make sure we do not match multiple remotes when waiting for them

    The current grep could match node-1 even when we're waiting for node-11.
    Add a space so we're sure that we match only the right node:
    [root@controller-0 ~]# pcs status |grep -e "overcloud-novacomputeiha-1\s.*Started"
     overcloud-novacomputeiha-1 (ocf::pacemaker:remote): Started controller-1
    [root@controller-0 ~]# pcs status |grep -e "overcloud-novacomputeiha-11\s.*Started"
     overcloud-novacomputeiha-11 (ocf::pacemaker:remote): Started controller-0

    Change-Id: Ieedf6020f78015a9fe19f8ae61e0e6e4a3b4cb5b
    Related-Bug: #1807906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/626151

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to puppet-tripleo (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/626152

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (stable/rocky)

Reviewed: https://review.openstack.org/626151
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=33063cdd44671cc87fb2803f1bfddf5b6048265b
Submitter: Zuul
Branch: stable/rocky

commit 33063cdd44671cc87fb2803f1bfddf5b6048265b
Author: Michele Baldessari <email address hidden>
Date: Tue Dec 11 11:56:25 2018 +0100

    Make sure we do not match multiple remotes when waiting for them

    The current grep could match node-1 even when we're waiting for node-11.
    Add a space so we're sure that we match only the right node:
    [root@controller-0 ~]# pcs status |grep -e "overcloud-novacomputeiha-1\s.*Started"
     overcloud-novacomputeiha-1 (ocf::pacemaker:remote): Started controller-1
    [root@controller-0 ~]# pcs status |grep -e "overcloud-novacomputeiha-11\s.*Started"
     overcloud-novacomputeiha-11 (ocf::pacemaker:remote): Started controller-0

    Change-Id: Ieedf6020f78015a9fe19f8ae61e0e6e4a3b4cb5b
    Related-Bug: #1807906
    (cherry picked from commit ce6df5888ddadcb6cfeed1db588b03cbf3fc103c)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to puppet-tripleo (stable/queens)

Reviewed: https://review.openstack.org/626152
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=882c78c0aee81267efa3eb2ea92c8012a5b8dff4
Submitter: Zuul
Branch: stable/queens

commit 882c78c0aee81267efa3eb2ea92c8012a5b8dff4
Author: Michele Baldessari <email address hidden>
Date: Tue Dec 11 11:56:25 2018 +0100

    Make sure we do not match multiple remotes when waiting for them

    The current grep could match node-1 even when we're waiting for node-11.
    Add a space so we're sure that we match only the right node:
    [root@controller-0 ~]# pcs status |grep -e "overcloud-novacomputeiha-1\s.*Started"
     overcloud-novacomputeiha-1 (ocf::pacemaker:remote): Started controller-1
    [root@controller-0 ~]# pcs status |grep -e "overcloud-novacomputeiha-11\s.*Started"
     overcloud-novacomputeiha-11 (ocf::pacemaker:remote): Started controller-0

    Change-Id: Ieedf6020f78015a9fe19f8ae61e0e6e4a3b4cb5b
    Related-Bug: #1807906
    (cherry picked from commit ce6df5888ddadcb6cfeed1db588b03cbf3fc103c)

tags: added: in-stable-queens
Changed in tripleo:
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-pacemaker 0.7.2

This issue was fixed in the openstack/puppet-pacemaker 0.7.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.