pcmk remote setup code has a race

Bug #1773754 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
puppet-pacemaker
Fix Released
Undecided
Michele Baldessari
tripleo
Fix Released
Medium
Michele Baldessari

Bug Description

Sometimes setting up a pcmk remote can fail (when using 1 controller node) because the authkey might be already removed by pcsd:
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:43 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.0354
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:43 UTC] "POST /remote/auth HTTP/1.1" 200 36 - -> /remote/auth
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2376
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2377
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:45 UTC] "POST /remote/cluster_stop HTTP/1.1" 200 32
    - -> /remote/cluster_stop
    I, [2018-05-24T09:40:39.075982 #19327] INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
    I, [2018-05-24T09:40:39.076036 #19327] INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-24T09:40:39.079019 #19327] DEBUG -- : []
    D, [2018-05-24T09:40:39.079073 #19327] DEBUG -- : ["Failed to initialize the cmap API. Error CS_ERR_LIBRARY\n"]
    D, [2018-05-24T09:40:39.079122 #19327] DEBUG -- : Duration: 0.002975586s
    I, [2018-05-24T09:40:39.079165 #19327] INFO -- : Return Value: 1
    W, [2018-05-24T09:40:39.079248 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file
    W, [2018-05-24T09:40:39.079293 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file or directory - /etc/corosync/corosync.conf
    D, [2018-05-24T09:40:39.079575 #19327] DEBUG -- : permission check action=full username=hacluster groups=
    D, [2018-05-24T09:40:39.079598 #19327] DEBUG -- : permission granted for superuser
    I, [2018-05-24T09:40:39.079629 #19327] INFO -- : Running: /usr/sbin/pcs cluster destroy
    I, [2018-05-24T09:40:39.079644 #19327] INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-24T09:40:39.874478 #19327] DEBUG -- : ["Shutting down pacemaker/corosync services...\n", "Killing any remaining services...\n", "Removing all cluster configuration files...\n"]

"Removing all cluster configuration files" is what removes the authkey

Changed in puppet-pacemaker:
assignee: nobody → Michele Baldessari (michele)
Changed in puppet-pacemaker:
status: New → In Progress
Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
importance: Undecided → Medium
status: New → Triaged
milestone: none → rocky-2
Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Michele Baldessari (michele) wrote :

https://review.openstack.org/569565 (puppet pacemaker authkey issue)
https://review.openstack.org/569578 (wait for remotes to be fully up)

Changed in tripleo:
milestone: rocky-2 → rocky-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-pacemaker (master)
Download full text (4.2 KiB)

Reviewed: https://review.openstack.org/569565
Committed: https://git.openstack.org/cgit/openstack/puppet-pacemaker/commit/?id=8695d6f0f41afac0cb4188f65ffb56de3438734a
Submitter: Zuul
Branch: master

commit 8695d6f0f41afac0cb4188f65ffb56de3438734a
Author: Michele Baldessari <email address hidden>
Date: Sat May 19 12:13:53 2018 +0200

    Fix a small race window when setting up pcmk remotes

    Right now we have the following constraint around the authkey used
    to create remotes:
    File['etc-pacemaker-authkey'] -> Exec["Create Cluster ${cluster_name}"]

    While this is okay in general, we have observed a race from time to time
    (but only when using 1 controller). The reason is the following:
    1) We call 'pcs cluster auth command':
    May 24 05:39:43 overcloud-controller-0 puppet-user[18207]: (/Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]) Triggered 'refresh' from 2 events

    2) We generate the authkey
    May 24 05:39:43 overcloud-controller-0 puppet-user[18207]: (/Stage[main]/Pacemaker::Corosync/File[etc-pacemaker-authkey]/ensure) defined content as '{md5}4a0c7c34ecef8b47616dc8ff675859cf'

    3) We call pcs cluster setup

    So either 1), 3) or the pcsd service restart somehow will forcefully trigger
    a /remote/cluster_stop API call, which then will trigger a removal of all existing cluster configuration:
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:43 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.0354
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:43 UTC] "POST /remote/auth HTTP/1.1" 200 36 - -> /remote/auth
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2376
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2377
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:45 UTC] "POST /remote/cluster_stop HTTP/1.1" 200 32
    - -> /remote/cluster_stop
    I, [2018-05-24T09:40:39.075982 #19327] INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
    I, [2018-05-24T09:40:39.076036 #19327] INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-24T09:40:39.079019 #19327] DEBUG -- : []
    D, [2018-05-24T09:40:39.079073 #19327] DEBUG -- : ["Failed to initialize the cmap API. Error CS_ERR_LIBRARY\n"]
    D, [2018-05-24T09:40:39.079122 #19327] DEBUG -- : Duration: 0.002975586s
    I, [2018-05-24T09:40:39.079165 #19327] INFO -- : Return Value: 1
    W, [2018-05-24T09:40:39.079248 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file
    W, [2018-05-24T09:40:39.079293 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file or directory - /etc/corosync/corosync.conf
    D, [2018-05-24T09:40:39.079575 #19327] DEBUG -- : permission check action=full username=hacluster groups=
    D, [2018-05-24T09:40:39.079598 #19327] DEBUG -- : permission granted for superuser
    I, [2018-05-24T09:40:39.079629 #19327] INFO -- : Running: /usr/sbin/pcs cluster destroy
    I, [2018-05-24T09:40:39.079644 #19327] INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-...

Read more...

Changed in puppet-pacemaker:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/569578
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5968aeb3207b4c42cc65eb7a7d3a831c4c28d456
Submitter: Zuul
Branch: master

commit 5968aeb3207b4c42cc65eb7a7d3a831c4c28d456
Author: Michele Baldessari <email address hidden>
Date: Sat May 19 16:55:35 2018 +0200

    Make sure remotes are fully up before proceeding

    We currently rely on 'verify_on_create => true' to make
    sure that pacemaker remotes up before proceeding to Step2 (during
    which a remote node is entitled to run pcs commands).
    So if the remote is still not fully up pcs commands can potentially
    fail on the remote nodes with errors like:

    Error: /Stage[main]/Tripleo::Profile::Pacemaker::Compute_instanceha
           /Pacemaker::Property[compute-instanceha-role-node-property]
           /Pcmk_property[property-overcloud-novacomputeiha-0-compute-instanceha-role]:
    Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster cib
    /var/lib/pacemaker/cib/puppet-cib-backup20180519-20162-ekt31x failed with code: 1 ->

    verify_on_create => true has an incorrect semantic currently
    as it does not really wait for a single resource to be fully up.
    Since implementing that properly will take quite a bit of work
    (given that pcs does not currently support single-resource state
    polling), for now we avoid using verify_on_create and we simply make
    sure the resource is started via an exec.

    Run 25 successful deployments with this (and the depends-on) patch.

    Closes-Bug: #1773754
    Depends-On: I74994a7e52a7470ead7862dd9083074f807f7675
    Change-Id: I9e5d5bb48fc7393df71d8b9eae200ad4ebaa6aa6

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/581017

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/queens)

Reviewed: https://review.openstack.org/581017
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=d5f54dd4170e72a1a50106e1ff82f990e6bf11a5
Submitter: Zuul
Branch: stable/queens

commit d5f54dd4170e72a1a50106e1ff82f990e6bf11a5
Author: Michele Baldessari <email address hidden>
Date: Sat May 19 16:55:35 2018 +0200

    Make sure remotes are fully up before proceeding

    We currently rely on 'verify_on_create => true' to make
    sure that pacemaker remotes up before proceeding to Step2 (during
    which a remote node is entitled to run pcs commands).
    So if the remote is still not fully up pcs commands can potentially
    fail on the remote nodes with errors like:

    Error: /Stage[main]/Tripleo::Profile::Pacemaker::Compute_instanceha
           /Pacemaker::Property[compute-instanceha-role-node-property]
           /Pcmk_property[property-overcloud-novacomputeiha-0-compute-instanceha-role]:
    Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster cib
    /var/lib/pacemaker/cib/puppet-cib-backup20180519-20162-ekt31x failed with code: 1 ->

    verify_on_create => true has an incorrect semantic currently
    as it does not really wait for a single resource to be fully up.
    Since implementing that properly will take quite a bit of work
    (given that pcs does not currently support single-resource state
    polling), for now we avoid using verify_on_create and we simply make
    sure the resource is started via an exec.

    Run 25 successful deployments with this (and the depends-on) patch.

    Closes-Bug: #1773754
    Depends-On: I74994a7e52a7470ead7862dd9083074f807f7675
    Change-Id: I9e5d5bb48fc7393df71d8b9eae200ad4ebaa6aa6
    (cherry picked from commit 5968aeb3207b4c42cc65eb7a7d3a831c4c28d456)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 9.2.0

This issue was fixed in the openstack/puppet-tripleo 9.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 8.3.5

This issue was fixed in the openstack/puppet-tripleo 8.3.5 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-pacemaker 0.7.2

This issue was fixed in the openstack/puppet-pacemaker 0.7.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.