commit 8695d6f0f41afac0cb4188f65ffb56de3438734a
Author: Michele Baldessari <email address hidden>
Date: Sat May 19 12:13:53 2018 +0200
Fix a small race window when setting up pcmk remotes
Right now we have the following constraint around the authkey used
to create remotes:
File['etc-pacemaker-authkey'] -> Exec["Create Cluster ${cluster_name}"]
While this is okay in general, we have observed a race from time to time
(but only when using 1 controller). The reason is the following:
1) We call 'pcs cluster auth command':
May 24 05:39:43 overcloud-controller-0 puppet-user[18207]: (/Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]) Triggered 'refresh' from 2 events
2) We generate the authkey
May 24 05:39:43 overcloud-controller-0 puppet-user[18207]: (/Stage[main]/Pacemaker::Corosync/File[etc-pacemaker-authkey]/ensure) defined content as '{md5}4a0c7c34ecef8b47616dc8ff675859cf'
3) We call pcs cluster setup
So either 1), 3) or the pcsd service restart somehow will forcefully trigger
a /remote/cluster_stop API call, which then will trigger a removal of all existing cluster configuration:
::ffff:172.17.1.22 - - [24/May/2018:09:39:43 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.0354
overcloud-controller-0.redhat.local - - [24/May/2018:09:39:43 UTC] "POST /remote/auth HTTP/1.1" 200 36 - -> /remote/auth
::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2376
::ffff:172.17.1.22 - - [24/May/2018:09:39:45 +0000] "POST /remote/cluster_stop HTTP/1.1" 200 32 0.2377
overcloud-controller-0.redhat.local - - [24/May/2018:09:39:45 UTC] "POST /remote/cluster_stop HTTP/1.1" 200 32
- -> /remote/cluster_stop
I, [2018-05-24T09:40:39.075982 #19327] INFO -- : Running: /usr/sbin/corosync-cmapctl totem.cluster_name
I, [2018-05-24T09:40:39.076036 #19327] INFO -- : CIB USER: hacluster, groups:
D, [2018-05-24T09:40:39.079019 #19327] DEBUG -- : []
D, [2018-05-24T09:40:39.079073 #19327] DEBUG -- : ["Failed to initialize the cmap API. Error CS_ERR_LIBRARY\n"]
D, [2018-05-24T09:40:39.079122 #19327] DEBUG -- : Duration: 0.002975586s
I, [2018-05-24T09:40:39.079165 #19327] INFO -- : Return Value: 1
W, [2018-05-24T09:40:39.079248 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file
W, [2018-05-24T09:40:39.079293 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/corosync.conf': No such file or directory - /etc/corosync/corosync.conf
D, [2018-05-24T09:40:39.079575 #19327] DEBUG -- : permission check action=full username=hacluster groups=
D, [2018-05-24T09:40:39.079598 #19327] DEBUG -- : permission granted for superuser
I, [2018-05-24T09:40:39.079629 #19327] INFO -- : Running: /usr/sbin/pcs cluster destroy
I, [2018-05-24T09:40:39.079644 #19327] INFO -- : CIB USER: hacluster, groups:
D, [2018-05-24T09:40:39.874478 #19327] DEBUG -- : ["Shutting down pacemaker/corosync services...\n", "Killing any remaining services...\n", "Removing all cluster configuration files...\n"]
The above removal of all cluster configuration is problematic because it
will also remove the puppet generated /etc/pacemaker/authkey and the
cluster setup command will generate a new random authkey for us. But
since this key is now different to what we have on the pcmk remote nodes
the cluster will fail with:
May 19 07:22:01 [15997] overcloud-novacomputeiha-0 pacemaker_remoted: error: lrmd_remote_client_msg: Remote lrmd tls handshake failed (-24)
(-24 is the error for decryption failed, aka differing keys).
Tested this with 25 successful runs in a row (usually it would fail
once every 5/6 deploys). We now simply make sure that the
authkey is generated before the cluster is started but after the cluster
setup.
Reviewed: https:/ /review. openstack. org/569565 /git.openstack. org/cgit/ openstack/ puppet- pacemaker/ commit/ ?id=8695d6f0f41 afac0cb4188f65f fb56de3438734a
Committed: https:/
Submitter: Zuul
Branch: master
commit 8695d6f0f41afac 0cb4188f65ffb56 de3438734a
Author: Michele Baldessari <email address hidden>
Date: Sat May 19 12:13:53 2018 +0200
Fix a small race window when setting up pcmk remotes
Right now we have the following constraint around the authkey used 'etc-pacemaker- authkey' ] -> Exec["Create Cluster ${cluster_name}"]
to create remotes:
File[
While this is okay in general, we have observed a race from time to time controller- 0 puppet-user[18207]: (/Stage[ main]/Pacemaker ::Corosync/ Exec[reauthenti cate-across- all-nodes] ) Triggered 'refresh' from 2 events
(but only when using 1 controller). The reason is the following:
1) We call 'pcs cluster auth command':
May 24 05:39:43 overcloud-
2) We generate the authkey controller- 0 puppet-user[18207]: (/Stage[ main]/Pacemaker ::Corosync/ File[etc- pacemaker- authkey] /ensure) defined content as '{md5}4a0c7c34e cef8b47616dc8ff 675859cf'
May 24 05:39:43 overcloud-
3) We call pcs cluster setup
So either 1), 3) or the pcsd service restart somehow will forcefully trigger cluster_ stop API call, which then will trigger a removal of all existing cluster configuration: 172.17. 1.22 - - [24/May/ 2018:09: 39:43 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.0354 controller- 0.redhat. local - - [24/May/ 2018:09: 39:43 UTC] "POST /remote/auth HTTP/1.1" 200 36 - -> /remote/auth 172.17. 1.22 - - [24/May/ 2018:09: 39:45 +0000] "POST /remote/ cluster_ stop HTTP/1.1" 200 32 0.2376 172.17. 1.22 - - [24/May/ 2018:09: 39:45 +0000] "POST /remote/ cluster_ stop HTTP/1.1" 200 32 0.2377 controller- 0.redhat. local - - [24/May/ 2018:09: 39:45 UTC] "POST /remote/ cluster_ stop HTTP/1.1" 200 32 cluster_ stop 24T09:40: 39.075982 #19327] INFO -- : Running: /usr/sbin/ corosync- cmapctl totem.cluster_name 24T09:40: 39.076036 #19327] INFO -- : CIB USER: hacluster, groups: 24T09:40: 39.079019 #19327] DEBUG -- : [] 24T09:40: 39.079073 #19327] DEBUG -- : ["Failed to initialize the cmap API. Error CS_ERR_LIBRARY\n"] 24T09:40: 39.079122 #19327] DEBUG -- : Duration: 0.002975586s 24T09:40: 39.079165 #19327] INFO -- : Return Value: 1 24T09:40: 39.079248 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/ corosync. conf': No such file 24T09:40: 39.079293 #19327] WARN -- : Cannot read config 'corosync.conf' from '/etc/corosync/ corosync. conf': No such file or directory - /etc/corosync/ corosync. conf 24T09:40: 39.079575 #19327] DEBUG -- : permission check action=full username=hacluster groups= 24T09:40: 39.079598 #19327] DEBUG -- : permission granted for superuser 24T09:40: 39.079629 #19327] INFO -- : Running: /usr/sbin/pcs cluster destroy 24T09:40: 39.079644 #19327] INFO -- : CIB USER: hacluster, groups: 24T09:40: 39.874478 #19327] DEBUG -- : ["Shutting down pacemaker/corosync services...\n", "Killing any remaining services...\n", "Removing all cluster configuration files...\n"]
a /remote/
::ffff:
overcloud-
::ffff:
::ffff:
overcloud-
- -> /remote/
I, [2018-05-
I, [2018-05-
D, [2018-05-
D, [2018-05-
D, [2018-05-
I, [2018-05-
W, [2018-05-
W, [2018-05-
D, [2018-05-
D, [2018-05-
I, [2018-05-
I, [2018-05-
D, [2018-05-
The above removal of all cluster configuration is problematic because it authkey and the
will also remove the puppet generated /etc/pacemaker/
cluster setup command will generate a new random authkey for us. But
since this key is now different to what we have on the pcmk remote nodes
the cluster will fail with:
May 19 07:22:01 [15997] overcloud- novacomputeiha- 0 pacemaker_remoted: error: lrmd_remote_ client_ msg: Remote lrmd tls handshake failed (-24)
(-24 is the error for decryption failed, aka differing keys).
Tested this with 25 successful runs in a row (usually it would fail
once every 5/6 deploys). We now simply make sure that the
authkey is generated before the cluster is started but after the cluster
setup.
Closes-Bug: #1773754
Change-Id: I74994a7e52a747 0ead7862dd90830 74f807f7675