overcloud deploy failed due to Systemd start for pcsd failed

Bug #1867602 reported by chandan kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Medium
Unassigned

Bug Description

On CentOS-8 fs035 and fs020 ovb multinode jobs are failing while doing overcloud deploy with following errors:

https://logserver.rdoproject.org/09/25909/1/check/periodic-tripleo-ci-centos-8-ovb-1ctlr_2comp-featureset020-master/bc847d4/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz and

https://logserver.rdoproject.org/09/25909/1/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-master/6777eaf/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

2020-03-16 08:37:28 | <13>Mar 16 08:37:25 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/User[hacluster]/groups: groups changed to ['haclient']
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: Error: Systemd start for pcsd failed!
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: journalctl log for pcsd:
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: -- Logs begin at Mon 2020-03-16 08:26:33 UTC, end at Mon 2020-03-16 08:37:26 UTC. --
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: Mar 16 08:37:25 overcloud-controller-0 systemd[1]: Starting PCS GUI and remote configuration interface...
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: Mar 16 08:37:26 overcloud-controller-0 systemd[1]: pcsd.service: Main process exited, code=exited, status=1/FAILURE
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: Mar 16 08:37:26 overcloud-controller-0 systemd[1]: pcsd.service: Failed with result 'exit-code'.
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: Mar 16 08:37:26 overcloud-controller-0 systemd[1]: Failed to start PCS GUI and remote configuration interface.
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: Error: /Stage[main]/Pacemaker::Service/Service[pcsd]/ensure: change from 'stopped' to 'running' failed: Systemd start for pcsd failed!
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: journalctl log for pcsd:
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: -- Logs begin at Mon 2020-03-16 08:26:33 UTC, end at Mon 2020-03-16 08:37:26 UTC. --
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: Mar 16 08:37:25 overcloud-controller-0 systemd[1]: Starting PCS GUI and remote configuration interface...
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: Mar 16 08:37:26 overcloud-controller-0 systemd[1]: pcsd.service: Main process exited, code=exited, status=1/FAILURE
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: Mar 16 08:37:26 overcloud-controller-0 systemd[1]: pcsd.service: Failed with result 'exit-code'.
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: Mar 16 08:37:26 overcloud-controller-0 systemd[1]: Failed to start PCS GUI and remote configuration interface.
2020-03-16 08:37:28 | <13>Mar 16 08:37:26 puppet-user: Notice: /Stage[main]/Pacemaker::Service/Service[pcsd]: Triggered 'refresh' from 1 event

From jouralctl logs, I didnot find anything https://logserver.rdoproject.org/09/25909/1/check/periodic-tripleo-ci-centos-8-ovb-1ctlr_2comp-featureset020-master/bc847d4/logs/overcloud-controller-0/var/log/journal.txt.gz

Based on discussion with bandini on irc, it might be this:
<bandini> yeah usually it is a dns failure, lemme check
<bandini> mmh this is different: E, [2020-03-16T02:34:50.204 #00000] ERROR -- : Unable to start pcsd daemon, exiting: [Errno 2] No such file or directory: '/var/lib/pcsd/pcsd.crt'
<bandini> never seen that one

summary: - overcloud deploy failed with Systemd start for pcsd failed
+ overcloud deploy failed due to Systemd start for pcsd failed
Revision history for this message
Michele Baldessari (michele) wrote :

Tested a downstream deploy and pcsd starts up just fine:
[root@controller-0 ~]# rpm -q pcs
pcs-0.10.2-4.el8.x86_64
[root@controller-0 ~]# rpm -q puppet-pacemaker puppet-tripleo
puppet-pacemaker-0.8.1-0.20200312111720.e1f66f2.el8ost.noarch
puppet-tripleo-11.4.1-0.20200312025814.116e8e8.el8ost.noarch

Revision history for this message
Michele Baldessari (michele) wrote :

Can we add the following hiera key 'pacemaker::corosync::pcsd_debug: true' to this promotion job?

wes hayutin (weshayutin)
Changed in tripleo:
status: Confirmed → Triaged
Revision history for this message
wes hayutin (weshayutin) wrote :

moving to incomplete, can't repro

Changed in tripleo:
status: Triaged → Incomplete
Revision history for this message
Michele Baldessari (michele) wrote :

So this seems to have resurfaced somehow. Yatin reproduced it with pcsd_debug set to true here:
https://logserver.rdoproject.org/19/713219/3/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001/1b7bf6b/logs/

Interestingly enough pcsd debug true gave us zero additional information, so we'll need to do a pcs code walk through and understand what is going wrong here.

The normal pcsd start operation does create those files, so not sure what is going on here really.

Changed in tripleo:
status: Incomplete → Triaged
Revision history for this message
Rabi Mishra (rabi) wrote :

Not sure if it's related. But I see

"2020-04-09 11:19:15 | Ignoring -days; not generating a certificate"

And is different from earlier passing jobs. May be the certs are getting corrupted with the change in https://review.opendev.org/#/c/717953/

https://logserver.rdoproject.org/53/717953/4/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001/7ea401f/logs/undercloud/home/zuul/overcloud_create_ssl_cert.log.txt.gz

Revision history for this message
Michele Baldessari (michele) wrote :

So the root cause seems to be a bad image and is likely the same root cause for https://bugs.launchpad.net/tripleo/+bug/1871703

I was looking at an image from 1871703 in Wes' environment and I saw that /var/lib/pcsd is missing, so likely that is the root cause for this problem

Revision history for this message
wes hayutin (weshayutin) wrote :

Chandan could not repro

Changed in tripleo:
status: Triaged → Incomplete
Revision history for this message
Rabi Mishra (rabi) wrote :

> Chandan could not repro

@wes, What does this mean? Are you saying there is no issue with the image as mentioned in comment #6?

We've noticed these failures today with the ovb jobs.

wes hayutin (weshayutin)
Changed in tripleo:
status: Incomplete → Triaged
Revision history for this message
wes hayutin (weshayutin) wrote :
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
Revision history for this message
Michele Baldessari (michele) wrote :

So the error is a bit different this time but it is still an image problem:
Apr 13 11:53:15 overcloud-controller-0 pcsd[35068]: FileNotFoundError: [Errno 2] No such file or directory: '/var/log/pcsd/pcsd.log'

So to me this implies that /var/log/pcsd has been removed, since it normally is part of the pcs rpm:
[root@controller-0 ~]# rpm -ql pcs |grep "var/log"
/var/log/pcsd

wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Revision history for this message
Rafael Folco (rafaelfolco) wrote :

seen this on https://logserver.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_2comp-featureset020-master/7f5fe14/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: Error: Systemd start for pcsd failed!
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: journalctl log for pcsd:
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: -- Logs begin at Sun 2020-05-31 18:53:38 UTC, end at Sun 2020-05-31 19:08:47 UTC. --
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: May 31 19:08:46 overcloud-controller-0 systemd[1]: Starting PCS GUI and remote configuration interface...
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: May 31 19:08:47 overcloud-controller-0 systemd[1]: pcsd.service: Main process exited, code=exited, status=1/FAILURE
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: May 31 19:08:47 overcloud-controller-0 systemd[1]: pcsd.service: Failed with result 'exit-code'.
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: May 31 19:08:47 overcloud-controller-0 systemd[1]: Failed to start PCS GUI and remote configuration interface.
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: Error: /Stage[main]/Pacemaker::Service/Service[pcsd]/ensure: change from 'stopped' to 'running' failed: Systemd start for pcsd failed!
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: journalctl log for pcsd:
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: -- Logs begin at Sun 2020-05-31 18:53:38 UTC, end at Sun 2020-05-31 19:08:47 UTC. --
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: May 31 19:08:46 overcloud-controller-0 systemd[1]: Starting PCS GUI and remote configuration interface...
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: May 31 19:08:47 overcloud-controller-0 systemd[1]: pcsd.service: Main process exited, code=exited, status=1/FAILURE
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: May 31 19:08:47 overcloud-controller-0 systemd[1]: pcsd.service: Failed with result 'exit-code'.
2020-05-31 19:08:50 | <13>May 31 19:08:47 puppet-user: May 31 19:08:47 overcloud-controller-0 systemd[1]: Failed to start PCS GUI and remote configuration interface.

Revision history for this message
wes hayutin (weshayutin) wrote :

This is a CI issue or DIB..

Changed in tripleo:
importance: Critical → Medium
tags: removed: promotion-blocker
Revision history for this message
wes hayutin (weshayutin) wrote :

Same issue as https://bugs.launchpad.net/tripleo/+bug/1879766

missing files..

Fix the image build and check please :)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers