puppet-pacemaker

pcmk remote setup code has a race

Bug #1773754 reported by Michele Baldessari on 2018-05-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	puppet-pacemaker	Fix Released	Undecided	Michele Baldessari
	tripleo	Fix Released	Medium	Michele Baldessari	tripleo rocky-3

Bug Description

Sometimes setting up a pcmk remote can fail (when using 1 controller node) because the authkey might be already removed by pcsd:
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:43 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.0354
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:43 UTC] "POST /remote/auth HTTP/1.1" 200 36 - -> /remote/auth
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2376
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2377
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:45 UTC] "POST /remote/cluster_stop HTTP/1.1" 200 32
    - -> /remote/cluster_stop
    I, [2018-05-24T09:40:39.075982 #19327] INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
    I, [2018-05-24T09:40:39.076036 #19327] INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-24T09:40:39.079019 #19327] DEBUG -- : []
    D, [2018-05-24T09:40:39.079073 #19327] DEBUG -- : ["Failed to initialize the cmap API. Error CS_ERR_LIBRARY\n"]
    D, [2018-05-24T09:40:39.079122 #19327] DEBUG -- : Duration: 0.002975586s
    I, [2018-05-24T09:40:39.079165 #19327] INFO -- : Return Value: 1
    W, [2018-05-24T09:40:39.079248 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file
    W, [2018-05-24T09:40:39.079293 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file or directory - /etc/corosync/corosync.conf
    D, [2018-05-24T09:40:39.079575 #19327] DEBUG -- : permission check action=full username=hacluster groups=
    D, [2018-05-24T09:40:39.079598 #19327] DEBUG -- : permission granted for superuser
    I, [2018-05-24T09:40:39.079629 #19327] INFO -- : Running: /usr/sbin/pcs cluster destroy
    I, [2018-05-24T09:40:39.079644 #19327] INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-24T09:40:39.874478 #19327] DEBUG -- : ["Shutting down pacemaker/corosync services...\n", "Killing any remaining services...\n", "Removing all cluster configuration files...\n"]

"Removing all cluster configuration files" is what removes the authkey

Tags:

Michele Baldessari (michele) on 2018-05-28

Changed in puppet-pacemaker:
assignee:	nobody → Michele Baldessari (michele)

OpenStack Infra (hudson-openstack) on 2018-05-28

Changed in puppet-pacemaker:
status:	New → In Progress

Michele Baldessari (michele) on 2018-05-28

Changed in tripleo:
assignee:	nobody → Michele Baldessari (michele)
importance:	Undecided → Medium
status:	New → Triaged
milestone:	none → rocky-2

OpenStack Infra (hudson-openstack) on 2018-05-28

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

Michele Baldessari (michele) wrote on 2018-05-28:

https://review.openstack.org/569565 (puppet pacemaker authkey issue)
https://review.openstack.org/569578 (wait for remotes to be fully up)

Emilien Macchi (emilienm) on 2018-06-05

Changed in tripleo:
milestone:	rocky-2 → rocky-3

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-12: Fix merged to puppet-pacemaker (master)

Download full text (4.2 KiB)

Reviewed: https://review.openstack.org/569565
Committed: https://git.openstack.org/cgit/openstack/puppet-pacemaker/commit/?id=8695d6f0f41afac0cb4188f65ffb56de3438734a
Submitter: Zuul
Branch: master

commit 8695d6f0f41afac0cb4188f65ffb56de3438734a
Author: Michele Baldessari <email address hidden>
Date: Sat May 19 12:13:53 2018 +0200

Fix a small race window when setting up pcmk remotes

    Right now we have the following constraint around the authkey used
    to create remotes:
    File['etc-pacemaker-authkey'] -> Exec["Create Cluster ${cluster_name}"]

    While this is okay in general, we have observed a race from time to time
    (but only when using 1 controller). The reason is the following:
    1) We call 'pcs cluster auth command':
    May 24 05:39:43 overcloud-controller-0 puppet-user[18207]: (/Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]) Triggered 'refresh' from 2 events

2) We generate the authkey
May 24 05:39:43 overcloud-controller-0 puppet-user[18207]: (/Stage[main]/Pacemaker::Corosync/File[etc-pacemaker-authkey]/ensure) defined content as '{md5}4a0c7c34ecef8b47616dc8ff675859cf'

3) We call pcs cluster setup

    So either 1), 3) or the pcsd service restart somehow will forcefully trigger
    a /remote/cluster_stop API call, which then will trigger a removal of all existing cluster configuration:
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:43 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.0354
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:43 UTC] "POST /remote/auth HTTP/1.1" 200 36 - -> /remote/auth
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2376
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2377
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:45 UTC] "POST /remote/cluster_stop HTTP/1.1" 200 32
    - -> /remote/cluster_stop
    I, [2018-05-24T09:40:39.075982 #19327] INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
    I, [2018-05-24T09:40:39.076036 #19327] INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-24T09:40:39.079019 #19327] DEBUG -- : []
    D, [2018-05-24T09:40:39.079073 #19327] DEBUG -- : ["Failed to initialize the cmap API. Error CS_ERR_LIBRARY\n"]
    D, [2018-05-24T09:40:39.079122 #19327] DEBUG -- : Duration: 0.002975586s
    I, [2018-05-24T09:40:39.079165 #19327] INFO -- : Return Value: 1
    W, [2018-05-24T09:40:39.079248 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file
    W, [2018-05-24T09:40:39.079293 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file or directory - /etc/corosync/corosync.conf
    D, [2018-05-24T09:40:39.079575 #19327] DEBUG -- : permission check action=full username=hacluster groups=
    D, [2018-05-24T09:40:39.079598 #19327] DEBUG -- : permission granted for superuser
    I, [2018-05-24T09:40:39.079629 #19327] INFO -- : Running: /usr/sbin/pcs cluster destroy
    I, [2018-05-24T09:40:39.079644 #19327] INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-...

Reviewed:  https://review.openstack.org/569565
Committed: https://git.openstack.org/cgit/openstack/puppet-pacemaker/commit/?id=8695d6f0f41afac0cb4188f65ffb56de3438734a
Submitter: Zuul
Branch:    master

commit 8695d6f0f41afac0cb4188f65ffb56de3438734a
Author: Michele Baldessari <michele@acksyn.org>
Date:   Sat May 19 12:13:53 2018 +0200

Fix a small race window when setting up pcmk remotes
    
    Right now we have the following constraint around the authkey used
    to create remotes:
    File['etc-pacemaker-authkey'] -> Exec["Create Cluster ${cluster_name}"]
    
    While this is okay in general, we have observed a race from time to time
    (but only when using 1 controller). The reason is the following:
    1) We call 'pcs cluster auth command':
    May 24 05:39:43 overcloud-controller-0 puppet-user[18207]: (/Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]) Triggered 'refresh' from 2 events
    
    2) We generate the authkey
    May 24 05:39:43 overcloud-controller-0 puppet-user[18207]: (/Stage[main]/Pacemaker::Corosync/File[etc-pacemaker-authkey]/ensure) defined content as '{md5}4a0c7c34ecef8b47616dc8ff675859cf'
    
    3) We call pcs cluster setup
    
    So either 1), 3) or the pcsd service restart somehow will forcefully trigger
    a /remote/cluster_stop API call, which then will trigger a removal of all existing cluster configuration:
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:43 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.0354
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:43 UTC] "POST /remote/auth HTTP/1.1" 200 36 - -> /remote/auth
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2376
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2377
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:45 UTC] "POST /remote/cluster_stop HTTP/1.1" 200 32
    - -> /remote/cluster_stop
    I, [2018-05-24T09:40:39.075982 #19327]  INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
    I, [2018-05-24T09:40:39.076036 #19327]  INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-24T09:40:39.079019 #19327] DEBUG -- : []
    D, [2018-05-24T09:40:39.079073 #19327] DEBUG -- : ["Failed to initialize the cmap API. Error CS_ERR_LIBRARY\n"]
    D, [2018-05-24T09:40:39.079122 #19327] DEBUG -- : Duration: 0.002975586s
    I, [2018-05-24T09:40:39.079165 #19327]  INFO -- : Return Value: 1
    W, [2018-05-24T09:40:39.079248 #19327]  WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file
    W, [2018-05-24T09:40:39.079293 #19327]  WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file or directory - /etc/corosync/corosync.conf
    D, [2018-05-24T09:40:39.079575 #19327] DEBUG -- : permission check action=full username=hacluster groups=
    D, [2018-05-24T09:40:39.079598 #19327] DEBUG -- : permission granted for superuser
    I, [2018-05-24T09:40:39.079629 #19327]  INFO -- : Running: /usr/sbin/pcs cluster destroy
    I, [2018-05-24T09:40:39.079644 #19327]  INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-24T09:40:39.874478 #19327] DEBUG -- : ["Shutting down pacemaker/corosync services...\n", "Killing any remaining services...\n", "Removing all cluster configuration files...\n"]
    
    The above removal of all cluster configuration is problematic because it
    will also remove the puppet generated /etc/pacemaker/authkey and the
    cluster setup command will generate a new random authkey for us. But
    since this key is now different to what we have on the pcmk remote nodes
    the cluster will fail with:
    
    May 19 07:22:01 [15997] overcloud-novacomputeiha-0 pacemaker_remoted:    error: lrmd_remote_client_msg: Remote lrmd tls handshake failed (-24)
    (-24 is the error for decryption failed, aka differing keys).
    
    Tested this with 25 successful runs in a row (usually it would fail
    once every 5/6 deploys). We now simply make sure that the
    authkey is generated before the cluster is started but after the cluster
    setup.
    
    Closes-Bug: #1773754
    
    Change-Id: I74994a7e52a7470ead7862dd9083074f807f7675

Changed in puppet-pacemaker:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-15: Fix merged to puppet-tripleo (master)

Reviewed: https://review.openstack.org/569578
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=5968aeb3207b4c42cc65eb7a7d3a831c4c28d456
Submitter: Zuul
Branch: master

commit 5968aeb3207b4c42cc65eb7a7d3a831c4c28d456
Author: Michele Baldessari <email address hidden>
Date: Sat May 19 16:55:35 2018 +0200

Make sure remotes are fully up before proceeding

    We currently rely on 'verify_on_create => true' to make
    sure that pacemaker remotes up before proceeding to Step2 (during
    which a remote node is entitled to run pcs commands).
    So if the remote is still not fully up pcs commands can potentially
    fail on the remote nodes with errors like:

    Error: /Stage[main]/Tripleo::Profile::Pacemaker::Compute_instanceha
           /Pacemaker::Property[compute-instanceha-role-node-property]
           /Pcmk_property[property-overcloud-novacomputeiha-0-compute-instanceha-role]:
    Could not evaluate: backup_cib: Running: /usr/sbin/pcs cluster cib
    /var/lib/pacemaker/cib/puppet-cib-backup20180519-20162-ekt31x failed with code: 1 ->

    verify_on_create => true has an incorrect semantic currently
    as it does not really wait for a single resource to be fully up.
    Since implementing that properly will take quite a bit of work
    (given that pcs does not currently support single-resource state
    polling), for now we avoid using verify_on_create and we simply make
    sure the resource is started via an exec.

Run 25 successful deployments with this (and the depends-on) patch.

    Closes-Bug: #1773754
    Depends-On: I74994a7e52a7470ead7862dd9083074f807f7675
    Change-Id: I9e5d5bb48fc7393df71d8b9eae200ad4ebaa6aa6

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-07-09: Fix proposed to puppet-tripleo (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/581017

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-07-12: Fix merged to puppet-tripleo (stable/queens)

Reviewed: https://review.openstack.org/581017
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=d5f54dd4170e72a1a50106e1ff82f990e6bf11a5
Submitter: Zuul
Branch: stable/queens

commit d5f54dd4170e72a1a50106e1ff82f990e6bf11a5
Author: Michele Baldessari <email address hidden>
Date: Sat May 19 16:55:35 2018 +0200

Make sure remotes are fully up before proceeding

Run 25 successful deployments with this (and the depends-on) patch.

    Closes-Bug: #1773754
    Depends-On: I74994a7e52a7470ead7862dd9083074f807f7675
    Change-Id: I9e5d5bb48fc7393df71d8b9eae200ad4ebaa6aa6
    (cherry picked from commit 5968aeb3207b4c42cc65eb7a7d3a831c4c28d456)