Perodic jobs in queens are failing because of problems in pacemaker
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Amol Kahat |
Bug Description
Description
===========
I found that corosync is not configured
- https:/
Which leads to another failure that pcs cluster is not configured.
- https:/
Pacemaker is working but not configured correctly.
$ pcs.txt
+ pcs status
Error: cluster is not currently running on this node
+ pcs config
Error: unable to get crm_config
Signon to CIB failed: Transport endpoint is not connected
Init failed, could not perform requested operations
Error: error running crm_mon, is pacemaker running?
Cluster Name:
Changed in tripleo: | |
status: | Triaged → Fix Released |
So we see that: main]/Tripleo: :Profile: :Base:: Kernel/ Exec[rebuild initramfs]) Triggered 'refresh' from 37 events main]/Tripleo: :Profile: :Base:: Kernel/ Sysctl: :Value[ net.nf_ conntrack_ max]/Sysctl_ runtime[ net.nf_ conntrack_ max]/val) val changed '262144' to '500000' main]/Tripleo: :Profile: :Base:: Pacemaker/ Systemd: :Unit_file[ docker. service] /File[/ etc/systemd/ system/ resource- agents- deps.target. wants/docker. service] /ensure) created main]/Tripleo: :Profile: :Base:: Pacemaker/ Systemd: :Unit_file[ rhel-push- plugin. service] /File[/ etc/systemd/ system/ resource- agents- deps.target. wants/rhel- push-plugin. service] /ensure) created main]/Pacemaker ::Corosync/ User[hacluster] /password) changed password main]/Pacemaker ::Corosync/ User[hacluster] /groups) groups changed '' to ['haclient'] main]/Pacemaker ::Service/ Service[ pcsd]/ensure) ensure changed 'stopped' to 'running'
1) pcsd gets started by puppet running on the host:
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[
Apr 29 09:52:41 node-0000072019 usermod[23255]: change user 'hacluster' password
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[
Apr 29 09:52:41 node-0000072019 usermod[23262]: add 'hacluster' to group 'haclient'
Apr 29 09:52:41 node-0000072019 usermod[23262]: add 'hacluster' to shadow group 'haclient'
Apr 29 09:52:41 node-0000072019 puppet-user[16055]: (/Stage[
Apr 29 09:52:42 node-0000072019 systemd[1]: Reloading.
Apr 29 09:52:42 node-0000072019 systemd[1]: Starting PCS GUI and remote configuration interface...
Apr 29 09:52:42 node-0000072019 systemd[1]: Started PCS GUI and remote configuration interface.
Apr 29 09:52:43 node-0000072019 puppet-user[16055]: (/Stage[
2) After pcsd service started, nothing else seems to be observed in the logs at all 8926)-- -sshd(8929) ---sh(16041) ---sudo( 16050)- --sh(16052) ---python( 16053)- --python( 16054)- --puppet( 160+
3) What we do know is that puppet was still running because in pstree we have:
| |-sshd(
So likely puppet is hanging *somewhere*.
From the pcsd logs we see the following last messages: 29T10:52: 44.742198 #23294] INFO -- : Running: /usr/sbin/pcs status nodes corosync 29T10:52: 44.742248 #23294] INFO -- : CIB USER: hacluster, groups: 29T10:52: 45.097004 #23294] INFO -- : Return Value: 1 29T10:52: 45.097178 #23294] INFO -- : Config files sync skipped, this host does not seem to be in a cluster of at least 2 nodes
I, [2020-04-
I, [2020-04-
I, [2020-04-
I, [2020-04-
I.e. after the pcsd service started it seems unlikely that we ran any new commands at all. Because normally pcsd logs most of the commands that are run via pcs.
So puppet is busy doing *something*, although even from the top.log I cannot see any interesting puppet child processes that could be hanging:
16053 root 20 0 88380 8180 3664 S 0.0 0.1 0:00.04 python
16054 root 20 0 103584 11128 3896 S 0.0 0.1 0:00.74 python
16055 root 20 0 410488 107196 14864 S 0.0 1.3 0:09.10 puppet
In terms of packages we have: 1.1.21- 4.el7.x86_ ...
pacemaker-