Bug #1875890 “Perodic jobs in queens are failing because of prob...” : Bugs : tripleo

Revision history for this message

Michele Baldessari (michele) wrote on 2020-04-29:

#1

Download full text (3.4 KiB)

So we see that:
1) pcsd gets started by puppet running on the host:
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Kernel/Exec[rebuild initramfs]) Triggered 'refresh' from 37 events
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Kernel/Sysctl::Value[net.nf_conntrack_max]/Sysctl_runtime[net.nf_conntrack_max]/val) val changed '262144' to '500000'
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Pacemaker/Systemd::Unit_file[docker.service]/File[/etc/systemd/system/resource-agents-deps.target.wants/docker.service]/ensure) created
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Pacemaker/Systemd::Unit_file[rhel-push-plugin.service]/File[/etc/systemd/system/resource-agents-deps.target.wants/rhel-push-plugin.service]/ensure) created
Apr 29 09:52:41 node-0000072019 usermod[23255]: change user 'hacluster' password
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Pacemaker::Corosync/User[hacluster]/password) changed password
Apr 29 09:52:41 node-0000072019 usermod[23262]: add 'hacluster' to group 'haclient'
Apr 29 09:52:41 node-0000072019 usermod[23262]: add 'hacluster' to shadow group 'haclient'
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Pacemaker::Corosync/User[hacluster]/groups) groups changed '' to ['haclient']
Apr 29 09:52:42 node-0000072019 systemd[1]: Reloading.
Apr 29 09:52:42 node-0000072019 systemd[1]: Starting PCS GUI and remote configuration interface...
Apr 29 09:52:42 node-0000072019 systemd[1]: Started PCS GUI and remote configuration interface.
Apr 29 09:52:43 node-0000072019 puppet-user[16055]: (/Stage[main]/Pacemaker::Service/Service[pcsd]/ensure) ensure changed 'stopped' to 'running'

2) After pcsd service started, nothing else seems to be observed in the logs at all
3) What we do know is that puppet was still running because in pstree we have:
| |-sshd(8926)---sshd(8929)---sh(16041)---sudo(16050)---sh(16052)---python(16053)---python(16054)---puppet(160+

So likely puppet is hanging *somewhere*.

From the pcsd logs we see the following last messages:
I, [2020-04-29T10:52:44.742198 #23294] INFO -- : Running: /usr/sbin/pcs status nodes corosync
I, [2020-04-29T10:52:44.742248 #23294] INFO -- : CIB USER: hacluster, groups:
I, [2020-04-29T10:52:45.097004 #23294] INFO -- : Return Value: 1
I, [2020-04-29T10:52:45.097178 #23294] INFO -- : Config files sync skipped, this host does not seem to be in a cluster of at least 2 nodes

I.e. after the pcsd service started it seems unlikely that we ran any new commands at all. Because normally pcsd logs most of the commands that are run via pcs.

So puppet is busy doing *something*, although even from the top.log I cannot see any interesting puppet child processes that could be hanging:
  16053 root 20 0 88380 8180 3664 S 0.0 0.1 0:00.04 python
  16054 root 20 0 103584 11128 3896 S 0.0 0.1 0:00.74 python
  16055 root 20 0 410488 107196 14864 S 0.0 1.3 0:09.10 puppet

In terms of packages we have:
pacemaker-1.1.21-4.el7.x86_...

So we see that:
1) pcsd gets started by puppet running on the host:
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Kernel/Exec[rebuild initramfs]) Triggered 'refresh' from 37 events
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Kernel/Sysctl::Value[net.nf_conntrack_max]/Sysctl_runtime[net.nf_conntrack_max]/val) val changed '262144' to '500000'
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Pacemaker/Systemd::Unit_file[docker.service]/File[/etc/systemd/system/resource-agents-deps.target.wants/docker.service]/ensure) created
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Tripleo::Profile::Base::Pacemaker/Systemd::Unit_file[rhel-push-plugin.service]/File[/etc/systemd/system/resource-agents-deps.target.wants/rhel-push-plugin.service]/ensure) created
Apr 29 09:52:41 node-0000072019 usermod[23255]: change user 'hacluster' password
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Pacemaker::Corosync/User[hacluster]/password) changed password
Apr 29 09:52:41 node-0000072019 usermod[23262]: add 'hacluster' to group 'haclient'
Apr 29 09:52:41 node-0000072019 usermod[23262]: add 'hacluster' to shadow group 'haclient'
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[main]/Pacemaker::Corosync/User[hacluster]/groups) groups changed '' to ['haclient']
Apr 29 09:52:42 node-0000072019 systemd[1]: Reloading.
Apr 29 09:52:42 node-0000072019 systemd[1]: Starting PCS GUI and remote configuration interface...
Apr 29 09:52:42 node-0000072019 systemd[1]: Started PCS GUI and remote configuration interface.
Apr 29 09:52:43 node-0000072019 puppet-user[16055]: (/Stage[main]/Pacemaker::Service/Service[pcsd]/ensure) ensure changed 'stopped' to 'running'

2) After pcsd service started, nothing else seems to be observed in the logs at all
3) What we do know is that puppet was still running because in pstree we have:
           |           |-sshd(8926)---sshd(8929)---sh(16041)---sudo(16050)---sh(16052)---python(16053)---python(16054)---puppet(160+

So likely puppet is hanging *somewhere*.

From the pcsd logs we see the following last messages:
I, [2020-04-29T10:52:44.742198 #23294]  INFO -- : Running: /usr/sbin/pcs status nodes corosync
I, [2020-04-29T10:52:44.742248 #23294]  INFO -- : CIB USER: hacluster, groups: 
I, [2020-04-29T10:52:45.097004 #23294]  INFO -- : Return Value: 1
I, [2020-04-29T10:52:45.097178 #23294]  INFO -- : Config files sync skipped, this host does not seem to be in a cluster of at least 2 nodes

I.e. after the pcsd service started it seems unlikely that we ran any new commands at all. Because normally pcsd logs most of the commands that are run via pcs.

So puppet is busy doing *something*, although even from the top.log I cannot see any interesting puppet child processes that could be hanging:
  16053 root      20   0   88380   8180   3664 S   0.0  0.1   0:00.04 python
  16054 root      20   0  103584  11128   3896 S   0.0  0.1   0:00.74 python
  16055 root      20   0  410488 107196  14864 S   0.0  1.3   0:09.10 puppet

In terms of packages we have:
pacemaker-1.1.21-4.el7.x86_64
corosync-2.4.5-4.el7.x86_64 
pcs-0.9.168-4.el7.centos.x86_64
libqb-1.0.1-9.el7.x86_64

They seem stock rhel-7.8

puppet-pacemaker-0.7.3-0.20190930085227.447cef0.el7.noarch is fairly old and has not changed recently.

Can we get puppet logs set to debug for this job?

Revision history for this message

Michele Baldessari (michele) wrote on 2020-04-29:

#2

Ok so here is the root cause:
puppet-pacemaker-0.7.3-0.20190930085227.447cef0.el7.noarch has a an older broken match regex to determine when it is running on rhel8, namely from https://github.com/openstack/puppet-pacemaker/blob/447cef0ad8891e89495fef7a9be5e2152b07edfa/manifests/params.pp we see it has:
      if $::operatingsystemrelease =~ /8\..*$/ {
        $pcs_010 = true
      } else {
        $pcs_010 = false
      }

That code will match on rhel 7.8 and so we basically kick off the codepaths that meant for Centos 8 and pcs 0.10 as opposed to pcs 0.9x and centos 7.

What I am not 100% sure is why we did not see an error message in the logs eventually and instead we timed out, because once I ran puppet --debug by hand on the env Panda gave me I saw the errors and it failed after 10 tries.

In any case the fix here is simply to rebase puppet-pacemaker in our releases and be done with it. This has been fixed in master and we do not really use branches for puppet-pacemaker.

I now tested a queens deploy on rhel 7.8 with puppet-pacemaker from master and it worked okay, so I think this should be pretty safe.

I will now test a deploy of queens+7.8 with stock "old" puppet-pacemaker and then redeploy against that with an updated puppet-pacemaker to make sure there are no surprises. If that is okay I will post a review to rdoinfo to move the pins to a puppet-pacemaker that has the operating system fix.

Revision history for this message

Michele Baldessari (michele) wrote on 2020-04-29:

#3

https://review.rdoproject.org/r/27035 should fix this

wes hayutin (weshayutin) on 2020-05-01

Changed in tripleo:
status:	Triaged → Fix Released

tripleo

Perodic jobs in queens are failing because of problems in pacemaker

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches