Comment 2 for bug 1773754

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-pacemaker (master)

Reviewed: https://review.openstack.org/569565
Committed: https://git.openstack.org/cgit/openstack/puppet-pacemaker/commit/?id=8695d6f0f41afac0cb4188f65ffb56de3438734a
Submitter: Zuul
Branch: master

commit 8695d6f0f41afac0cb4188f65ffb56de3438734a
Author: Michele Baldessari <email address hidden>
Date: Sat May 19 12:13:53 2018 +0200

    Fix a small race window when setting up pcmk remotes

    Right now we have the following constraint around the authkey used
    to create remotes:
    File['etc-pacemaker-authkey'] -> Exec["Create Cluster ${cluster_name}"]

    While this is okay in general, we have observed a race from time to time
    (but only when using 1 controller). The reason is the following:
    1) We call 'pcs cluster auth command':
    May 24 05:39:43 overcloud-controller-0 puppet-user[18207]: (/Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]) Triggered 'refresh' from 2 events

    2) We generate the authkey
    May 24 05:39:43 overcloud-controller-0 puppet-user[18207]: (/Stage[main]/Pacemaker::Corosync/File[etc-pacemaker-authkey]/ensure) defined content as '{md5}4a0c7c34ecef8b47616dc8ff675859cf'

    3) We call pcs cluster setup

    So either 1), 3) or the pcsd service restart somehow will forcefully trigger
    a /remote/cluster_stop API call, which then will trigger a removal of all existing cluster configuration:
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:43 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.0354
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:43 UTC] "POST /remote/auth HTTP/1.1" 200 36 - -> /remote/auth
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2376
    ::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2377
    overcloud-controller-0.redhat.local - - [24/May/2018:09:39:45 UTC] "POST /remote/cluster_stop HTTP/1.1" 200 32
    - -> /remote/cluster_stop
    I, [2018-05-24T09:40:39.075982 #19327] INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
    I, [2018-05-24T09:40:39.076036 #19327] INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-24T09:40:39.079019 #19327] DEBUG -- : []
    D, [2018-05-24T09:40:39.079073 #19327] DEBUG -- : ["Failed to initialize the cmap API. Error CS_ERR_LIBRARY\n"]
    D, [2018-05-24T09:40:39.079122 #19327] DEBUG -- : Duration: 0.002975586s
    I, [2018-05-24T09:40:39.079165 #19327] INFO -- : Return Value: 1
    W, [2018-05-24T09:40:39.079248 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file
    W, [2018-05-24T09:40:39.079293 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file or directory - /etc/corosync/corosync.conf
    D, [2018-05-24T09:40:39.079575 #19327] DEBUG -- : permission check action=full username=hacluster groups=
    D, [2018-05-24T09:40:39.079598 #19327] DEBUG -- : permission granted for superuser
    I, [2018-05-24T09:40:39.079629 #19327] INFO -- : Running: /usr/sbin/pcs cluster destroy
    I, [2018-05-24T09:40:39.079644 #19327] INFO -- : CIB USER: hacluster, groups:
    D, [2018-05-24T09:40:39.874478 #19327] DEBUG -- : ["Shutting down pacemaker/corosync services...\n", "Killing any remaining services...\n", "Removing all cluster configuration files...\n"]

    The above removal of all cluster configuration is problematic because it
    will also remove the puppet generated /etc/pacemaker/authkey and the
    cluster setup command will generate a new random authkey for us. But
    since this key is now different to what we have on the pcmk remote nodes
    the cluster will fail with:

    May 19 07:22:01 [15997] overcloud-novacomputeiha-0 pacemaker_remoted: error: lrmd_remote_client_msg: Remote lrmd tls handshake failed (-24)
    (-24 is the error for decryption failed, aka differing keys).

    Tested this with 25 successful runs in a row (usually it would fail
    once every 5/6 deploys). We now simply make sure that the
    authkey is generated before the cluster is started but after the cluster
    setup.

    Closes-Bug: #1773754

    Change-Id: I74994a7e52a7470ead7862dd9083074f807f7675