newton/ha: tripleo cluster fails to build

Bug #1660331 reported by Emilien Macchi on 2017-01-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Unassigned

Bug Description

It happens on Newton CI jobs for HA scenario:
http://logs.openstack.org/16/426716/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-ha-newton/bd89ac8/logs/postci.txt.gz#_2017-01-30_12_52_18_000

2017-01-30 12:52:18.000 | Error: /sbin/pcs cluster setup --name tripleo_cluster controller-0-tripleo-ci-a-foo controller-1-tripleo-ci-b-bar controller-2-tripleo-ci-c-baz --token 10000 returned 1 instead of one of [0]
2017-01-30 12:52:18.000 | Error: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: change from notrun to 0 failed: /sbin/pcs cluster setup --name tripleo_cluster controller-0-tripleo-ci-a-foo controller-1-tripleo-ci-b-bar controller-2-tripleo-ci-c-baz --token 10000 returned 1 instead of one of [0]

Tags: ci Edit Tag help
Emilien Macchi (emilienm) wrote :

https://www.diffchecker.com/xa89o6Nm is the packaging diff between failing & working jobs. I see nothing related to Pacemker. It's maybe a random failure, let's see if it's consistent.

Michele Baldessari (michele) wrote :
Download full text (4.9 KiB)

So this seems to be something that happens rather rarely (I checked some other ovb-ha logs and have not yet seen an occurrence of this one), but it does seem like a real problem/race.
What seems to happen is the following (note the time 12:46:39 is slightly misleading. The action happened a little before but os-collect-config logs everything in a big chunk):
Jan 30 12:46:39 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: Triggered 'refresh' from 2 events
Jan 30 12:46:39 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: Error: unable to destroy cluster#033[0m
Jan 30 12:46:39 localhost os-collect-config: #033[mNotice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: controller-0-tripleo-ci-a-foo: Unable to authenticate to controller-0-tripleo-ci-a-foo - (HTTP error: 401), try running 'pcs cluster auth'#033[0m

So basically when puppet-pacemaker does this:
    ->
    exec {"Create Cluster ${cluster_name}":
      creates => '/etc/cluster/cluster.conf',
      command => "${::pacemaker::pcs_bin} cluster setup --name ${cluster_name} ${cluster_members_rrp_real} ${cluster_setup_extras_real}",
      unless => '/usr/bin/test -f /etc/corosync/corosync.conf',
      require => Class['::pacemaker::install'],
    }

pcs will actually call the destroy_cluster (in case it existed before) but it gets a 401 on the controller-0 node:
Notice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]/returns: controller-0-tripleo-ci-a-foo: Unable to authenticate to controller-0-tripleo-ci-a-foo - (HTTP error: 401), try running 'pcs cluster auth'

The corresponding pcsd log shows the following:
I, [2017-01-30T12:46:36.570051 #28804] INFO -- : Return Value: 0
I, [2017-01-30T12:46:36.570128 #28804] INFO -- : Successful login by 'hacluster'
::ffff:172.17.0.253 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1176
::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1145
::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1147
controller-0-tripleo-ci-a-foo.localdomain - - [30/Jan/2017:12:46:36 UTC] "POST /remote/auth HTTP/1.1" 200 36
::ffff:172.17.0.253 - - [30/Jan/2017:12:46:36 +0000] "POST /remote/auth HTTP/1.1" 200 36 0.1188
controller-2-tripleo-ci-c-baz.localdomain - - [30/Jan/2017:12:46:36 UTC] "POST /remote/auth HTTP/1.1" 200 36
- -> /remote/auth
- -> /remote/auth
::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "GET /remote/cluster_destroy HTTP/1.1" 401 24 0.0042
::ffff:172.17.0.251 - - [30/Jan/2017:12:46:36 +0000] "GET /remote/cluster_destroy HTTP/1.1" 401 24 0.0044
controller-0-tripleo-ci-a-foo.localdomain - - [30/Jan/2017:12:46:36 UTC] "GET /remote/cluster_destroy HTTP/1.1" 401 24
- -> /remote/cluster_destroy

So even though we correctly did an auth ...

Read more...

tags: removed: alert
Changed in tripleo:
milestone: ocata-rc1 → ocata-rc2
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers