Perodic jobs in queens are failing because of problems in pacemaker

Bug #1875890 reported by Amol Kahat
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Amol Kahat

Bug Description

Description
===========

I found that corosync is not configured
- https://logserver.rdoproject.org/openstack-periodic-wednesday-weekend/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-queens/cca03e3/logs/subnode-1/etc/corosync/

Which leads to another failure that pcs cluster is not configured.
- https://logserver.rdoproject.org/openstack-periodic-wednesday-weekend/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset010-queens/cca03e3/logs/subnode-1/var/log/extra/pcs.txt.gz

Pacemaker is working but not configured correctly.

$ pcs.txt
+ pcs status
Error: cluster is not currently running on this node
+ pcs config
Error: unable to get crm_config
Signon to CIB failed: Transport endpoint is not connected
Init failed, could not perform requested operations

Error: error running crm_mon, is pacemaker running?
Cluster Name:

Revision history for this message
Michele Baldessari (michele) wrote :
Download full text (3.4 KiB)

So we see that:
1) pcsd gets started by puppet running on the host:
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Kernel/Exec[rebuild initramfs]) Triggered 'refresh' from 37 events
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Kernel/Sysctl::Value[net.nf_conntrack_max]/Sysctl_runtime[net.nf_conntrack_max]/val) val changed '262144' to '500000'
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Pacemaker/Systemd::Unit_file[docker.service]/File[/etc/systemd/system/resource-agents-deps.target.wants/docker.service]/ensure) created
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Pacemaker/Systemd::Unit_file[rhel-push-plugin.service]/File[/etc/systemd/system/resource-agents-deps.target.wants/rhel-push-plugin.service]/ensure) created
Apr 29 09:52:41 node-0000072019 usermod[23255]: change user 'hacluster' password
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Pacemaker::Corosync/User[hacluster]/password) changed password
Apr 29 09:52:41 node-0000072019 usermod[23262]: add 'hacluster' to group 'haclient'
Apr 29 09:52:41 node-0000072019 usermod[23262]: add 'hacluster' to shadow group 'haclient'
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Pacemaker::Corosync/User[hacluster]/groups) groups changed '' to ['haclient']
Apr 29 09:52:42 node-0000072019 systemd[1]: Reloading.
Apr 29 09:52:42 node-0000072019 systemd[1]: Starting PCS GUI and remote configuration interface...
Apr 29 09:52:42 node-0000072019 systemd[1]: Started PCS GUI and remote configuration interface.
Apr 29 09:52:43 node-0000072019 puppet-user[16055]: (/Stage[main]/Pacemaker::Service/Service[pcsd]/ensure) ensure changed 'stopped' to 'running'

2) After pcsd service started, nothing else seems to be observed in the logs at all
3) What we do know is that puppet was still running because in pstree we have:
           | |-sshd(8926)---sshd(8929)---sh(16041)---sudo(16050)---sh(16052)---python(16053)---python(16054)---puppet(160+

So likely puppet is hanging *somewhere*.

From the pcsd logs we see the following last messages:
I, [2020-04-29T10:52:44.742198 #23294] INFO -- : Running: /usr/sbin/pcs status nodes corosync
I, [2020-04-29T10:52:44.742248 #23294] INFO -- : CIB USER: hacluster, groups:
I, [2020-04-29T10:52:45.097004 #23294] INFO -- : Return Value: 1
I, [2020-04-29T10:52:45.097178 #23294] INFO -- : Config files sync skipped, this host does not seem to be in a cluster of at least 2 nodes

I.e. after the pcsd service started it seems unlikely that we ran any new commands at all. Because normally pcsd logs most of the commands that are run via pcs.

So puppet is busy doing *something*, although even from the top.log I cannot see any interesting puppet child processes that could be hanging:
  16053 root 20 0 88380 8180 3664 S 0.0 0.1 0:00.04 python
  16054 root 20 0 103584 11128 3896 S 0.0 0.1 0:00.74 python
  16055 root 20 0 410488 107196 14864 S 0.0 1.3 0:09.10 puppet

In terms of packages we have:
pacemaker-1.1.21-4.el7.x86_...

Read more...

Revision history for this message
Michele Baldessari (michele) wrote :

Ok so here is the root cause:
puppet-pacemaker-0.7.3-0.20190930085227.447cef0.el7.noarch has a an older broken match regex to determine when it is running on rhel8, namely from https://github.com/openstack/puppet-pacemaker/blob/447cef0ad8891e89495fef7a9be5e2152b07edfa/manifests/params.pp we see it has:
      if $::operatingsystemrelease =~ /8\..*$/ {
        $pcs_010 = true
      } else {
        $pcs_010 = false
      }

That code will match on rhel 7.8 and so we basically kick off the codepaths that meant for Centos 8 and pcs 0.10 as opposed to pcs 0.9x and centos 7.

What I am not 100% sure is why we did not see an error message in the logs eventually and instead we timed out, because once I ran puppet --debug by hand on the env Panda gave me I saw the errors and it failed after 10 tries.

In any case the fix here is simply to rebase puppet-pacemaker in our releases and be done with it. This has been fixed in master and we do not really use branches for puppet-pacemaker.

I now tested a queens deploy on rhel 7.8 with puppet-pacemaker from master and it worked okay, so I think this should be pretty safe.

I will now test a deploy of queens+7.8 with stock "old" puppet-pacemaker and then redeploy against that with an updated puppet-pacemaker to make sure there are no surprises. If that is okay I will post a review to rdoinfo to move the pins to a puppet-pacemaker that has the operating system fix.

Revision history for this message
Michele Baldessari (michele) wrote :
wes hayutin (weshayutin)
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.