So this seems to be something that happens rather rarely (I checked some other ovb-ha logs and have not yet seen an occurrence of this one), but it does seem like a real problem/race. What seems to happen is the following (note the time 12:46:39 is slightly misleading. The action happened a little before but os-collect-config logs everything in a big chunk): Jan 30 12:46:39 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: Triggered 'refresh' from 2 events Jan 30 12:46:39 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: Error: unable to destroy cluster#033[0m Jan 30 12:46:39 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: controller-0-tripleo-ci-a-foo: Unable to authenticate to controller-0-tripleo-ci-a-foo - (HTTP error: 401), try running 'pcs cluster auth'#033[0m So basically when puppet-pacemaker does this: -> exec {"Create Cluster ${cluster_name}": creates => '/etc/cluster/cluster.conf', command => "${::pacemaker::pcs_bin} cluster setup --name ${cluster_name} ${cluster_members_rrp_real} ${cluster_setup_extras_real}", unless => '/usr/bin/test -f /etc/corosync/corosync.conf', require => Class['::pacemaker::install'], } pcs will actually call the destroy_cluster (in case it existed before) but it gets a 401 on the controller-0 node: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: controller-0-tripleo-ci-a-foo: Unable to authenticate to controller-0-tripleo-ci-a-foo - (HTTP error: 401), try running 'pcs cluster auth' The corresponding pcsd log shows the following: I, [2017-01-30T12:46:36.570051 #28804] INFO -- : Return Value: 0 I, [2017-01-30T12:46:36.570128 #28804] INFO -- : Successful login by 'hacluster' ::ffff:172.17.0.253 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1176 ::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1145 ::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1147 controller-0-tripleo-ci-a-foo.localdomain - - [30/Jan/2017:12:46:36 UTC] "POST /remote/auth HTTP/1.1" 200 36 ::ffff:172.17.0.253 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1188 controller-2-tripleo-ci-c-baz.localdomain - - [30/Jan/2017:12:46:36 UTC] "POST /remote/auth HTTP/1.1" 200 36 - -> /remote/auth - -> /remote/auth ::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "GET /remote/cluster_destroy HTTP/1.1" 401 24 0.0042 ::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "GET /remote/cluster_destroy HTTP/1.1" 401 24 0.0044 controller-0-tripleo-ci-a-foo.localdomain - - [30/Jan/2017:12:46:36 UTC] "GET /remote/cluster_destroy HTTP/1.1" 401 24 - -> /remote/cluster_destroy So even though we correctly did an auth for controller-0: controller-0-tripleo-ci-a-foo.localdomain - - [30/Jan/2017:12:46:36 UTC] "POST /remote/auth HTTP/1.1" 200 36 We still returned 401. The code in pcsd which barfed on us is here: def cluster_destroy(params, request, auth_user) if not allowed_for_local_cluster(auth_user, Permissions::FULL) return 403, 'Permission denied' end out, errout, retval = run_cmd(auth_user, PCS, "cluster", "destroy") if retval == 0 return [200, "Successfully destroyed cluster"] else return [400, "Error destroying cluster:\n#{out}\n#{errout}\n#{retval}\n"] end end So it means run_cmd() returned 401. The problem is that the line "/Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: Triggered 'refresh' from 2 events#" tells us that pcs cluster auth *was* indeed run already, although it would be nice to collect puppet debug logs by default. So there are a couple of things to do in order to get to the bottom of an issue like this: a) enable puppet debug logs by default in CI b) add retries to the cluster setup operations c) get to the bottom of this particular occurrence I think we should start with a) as the will help immensely with issues like this