Wallaby OVB jobs are failing node provision step - ironic conductor: PowerStateFailure: Failed to set node power state to power off

Bug #2034704 reported by Ronelle Landy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Incomplete
Critical
Unassigned
Ronelle Landy (rlandy)
Changed in tripleo:
milestone: none → antelope-1
importance: Undecided → Critical
status: New → Triaged
tags: added: promotion-blocker
Revision history for this message
Julia Kreger (juliaashleykreger) wrote :

The fundamental issue is the "openstackbmc" is failing to power off the machine, but with the way that is all architected and setup, we have no way of knowing why as far as I can see.

All we have is the console log from the BMC:

https://logserver.rdoproject.org/73/37973/86/check/periodic-tripleo-ci-centos-9-ovb-3ctlr[…]tureset001-wallaby/d478e78/logs/bmc_86_67346-console.log

And if you look for:

[ 3807.890716] openstackbmc[14525]: Powered off 90a12976-17f7-4ded-9f6d-5537c71a694d

You can see the VM power status continues to report that it is on. The ironic conductor service logs mirror this. In other words, it very much doesn't look like service code, but possibly openstackbmc. Openstack-virtual-baremetal last saw changes back in 2022, so could an underlying libvirt change have broken things?

Revision history for this message
Sandeep Yadav (sandeepyadav93) wrote :

>> Openstack-virtual-baremetal last saw changes back in 2022, so could an underlying libvirt change have broken things?

We have been experiencing this intermittent problem with jobs running in both the Vexxhost and PSI clouds. The first failure we noticed occurred on August 28th in the Vexxhost cloud and on August 29th in the PSI internal cloud. Given that the issue started around the same time and the possibility of both PSI and Vexxhost upgrading their infrastructure is low, I wonder if the issue is related to a different layer.

Vexx first failure, 28th

https://logserver.rdoproject.org/openstack-periodic-integration-stable1/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-9-ovb-1ctlr_1comp-featureset002-wallaby/d9150b2/job-output.txt
~~~
2023-08-28 18:52:33.799837 | primary | TASK [tripleo.operator.tripleo_overcloud_node_provision : overcloud node provision] ***
2023-08-28 18:52:33.799887 | primary | Monday 28 August 2023 18:52:33 -0400 (0:00:01.751) 0:50:15.892 *********
2023-08-28 18:52:47.039195 | primary | ASYNC POLL on undercloud: jid=527423317646.197406 started=1 finished=0
.
.
2023-08-28 18:59:23.926089 | primary | fatal: [undercloud]: FAILED! => {"ansible_job_id": "527423317646.197406", "changed": false, "cmd": "source /home/zuul/stackrc; openstack overcloud node provision -o $PROVISION_OUTPUT --stack $PROVISION_STACK /home/zuul/overcloud_baremetal_deploy.yaml >/home/zuul/overcloud_node_provision.log 2>&1", "delta": "0:06:36.565803", "end": "2023-08-28 22:59:12.256865", "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2023-08-28 22:52:35.691062", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
~~~

PSI first failure, 29th

https://sf.hosted.upshift.rdu2.redhat.com/logs/44/49544/24/check-rdo/periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-internal-wallaby/5c13a92/job-output.txt

2023-08-29 19:00:41.515201 | primary | fatal: [undercloud]: FAILED! => {"ansible_job_id": "966085717323.182995", "changed": false, "cmd": "source /home/zuul/stackrc; openstack overcloud node provision -o $PROVISION_OUTPUT --stack $PROVISION_STACK /home/zuul/overcloud_baremetal_deploy.yaml >/home/zuul/overcloud_node_provision.log 2>&1", "delta": "0:08:23.094709", "end": "2023-08-29 23:00:39.179603", "finished": 1, "msg": "non-zero return code", "rc": 1, "start": "2023-08-29 22:52:16.084894", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

Sharing build history for fs001:-

https://sf.hosted.upshift.rdu2.redhat.com/zuul/t/tripleo-ci-internal/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-internal-wallaby
https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-centos-9-ovb-3ctlr_1comp-featureset001-wallaby

Revision history for this message
Julia Kreger (juliaashleykreger) wrote :
Download full text (3.6 KiB)

Here is the question, why does openstackbmc fail to actually power off the machines:

[ 3718.893431] openstackbmc[14225]: Powered off bd155666-3488-42b0-a39e-28f726399af7
[ 3721.038759] openstackbmc[14258]: Reporting power state "True" for instance 29945b59-a1a4-4ea8-b651-89cced710943
[ 3724.459366] openstackbmc[14225]: Reporting power state "True" for instance bd155666-3488-42b0-a39e-28f726399af7
[ 3726.658522] openstackbmc[14258]: Powered off 29945b59-a1a4-4ea8-b651-89cced710943
[ 3729.990707] openstackbmc[14225]: Reporting power state "True" for instance bd155666-3488-42b0-a39e-28f726399af7
[ 3732.823212] openstackbmc[14258]: Reporting power state "True" for instance 29945b59-a1a4-4ea8-b651-89cced710943 <-- Why is the instance *still* running after we told it to power off 11 seconds prior?
[ 3735.625339] openstackbmc[14225]: Reporting power state "True" for instance bd155666-3488-42b0-a39e-28f726399af7 <-- Why is the instance *still* running when we told it to power off 17 seconds prior?
[ 3737.995547] openstackbmc[14258]: Reporting power state "True" for instance 29945b59-a1a4-4ea8-b651-89cced710943
[ 3741.144174] openstackbmc[14225]: Reporting power state "True" for instance bd155666-3488-42b0-a39e-28f726399af7
[ 3743.521301] openstackbmc[14258]: Reporting power state "True" for instance 29945b59-a1a4-4ea8-b651-89cced710943
[ 3747.360741] openstackbmc[14225]: Reporting power state "True" for instance bd155666-3488-42b0-a39e-28f726399af7
[ 3749.052793] openstackbmc[14258]: Reporting power state "True" for instance 29945b59-a1a4-4ea8-b651-89cced710943
[ 3753.872864] openstackbmc[14225]: Reporting power state "True" for instance bd155666-3488-42b0-a39e-28f726399af7
[ 3755.096879] openstackbmc[14258]: Reporting power state "True" for instance 29945b59-a1a4-4ea8-b651-89cced710943
[ 3763.672337] openstackbmc[14225]: Reporting power state "True" for instance bd155666-3488-42b0-a39e-28f726399af7
[ 3765.150681] openstackbmc[14258]: Reporting power state "True" for instance 29945b59-a1a4-4ea8-b651-89cced710943
[ 3777.503347] openstackbmc[14258]: Reporting power state "True" for instance 29945b59-a1a4-4ea8-b651-89cced710943
[ 3779.204387] openstackbmc[14225]: Reporting power state "True" for instance bd155666-3488-42b0-a39e-28f726399af7
[ 3784.885856] openstackbmc[14225]: Reporting power state "False" for instance bd155666-3488-42b0-a39e-28f726399af7 <-- Finally "off" 66 seconds later.

[ 3794.906170] openstackbmc[14258]: Reporting power state "False" for instance 29945b59-a1a4-4ea8-b651-89cced710943 <-- Finally "off", 68 seconds later.

[ 3800.534717] openstackbmc[14258]: Reporting power state "False" for instance 29945b59-a1a4-4ea8-b651-89cced710943
[ 3806.221619] openstackbmc[14258]: Powered on 29945b59-a1a4-4ea8-b651-89cced710943

It is almost as if somewhere along the way "power off" is being translated to "soft power off.

The challenge is openstackbmc calls os-stop, os-stop being an async action with nova.

Per discussion with Sean Mooney, the stop command with nova is a graceful shutdown, which then falls back to a forceful shutdown. https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.graceful_shutdown_timeout

There...

Read more...

Revision history for this message
Alan Pevec (apevec) wrote :

@TheJulia @hjensas is there a recommendation from Ironic to the CI team what logging to add so we capture the timing when these race-conditions happen?

Revision history for this message
Harald Jensås (harald-jensas) wrote :

Closing this issue.

This was intermittent, and we don't see it anymore.

The root cause was that instances took a long time to power off, nova does a graceful shutdown and after a timeout it will force the instance to stop. An issue in the booted image is likely the reason the graceful shutdown took too long/failed. We don't know exactly what change, but a good guess is that some change in upstream CentOS improved the graceful shutdown behavior in out images.

Changed in tripleo:
status: Triaged → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.