OVB jobs are consistently failing on tripleo overcloud node introspect: "Introspection timeout"

Bug #1947403 reported by Douglas Viroel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Some OVB jobs are failing on overcloud node introspect:

https://logserver.rdoproject.org/openstack-periodic-integration-stable4/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_2comp-featureset020-train/79b2e6c/logs/quickstart_install.log

TASK [tripleo.operator.tripleo_overcloud_node_introspect : Introspect node] ****
Friday 15 October 2021 08:16:23 +0000 (0:00:00.450) 0:44:53.423 ********
fatal: [undercloud]: FAILED! => {
    "changed": true,
    "cmd": " openstack overcloud node introspect --all-manageable --provide >/home/zuul/overcloud_introspect.log 2>&1",
    "delta": "1:03:12.304823",
    "end": "2021-10-15 09:19:36.242646",
    "rc": 1,
    "start": "2021-10-15 08:16:23.937823"
}

https://logserver.rdoproject.org/openstack-periodic-integration-stable4/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-1ctlr_2comp-featureset020-train/79b2e6c/logs/undercloud/home/zuul/overcloud_introspect.log.txt.gz

Introspection completed with errors:
fd4e9887-3001-44c2-acd2-e55e74aac9ca: Introspection timeout
51524224-6c3e-4fa0-9232-9f24b3264093: Introspection timeout
c776ba5d-a343-4223-928d-3877e28936ca: Introspection timeout

Revision history for this message
Pooja Jadhav (poojajadhav22) wrote :
Revision history for this message
Douglas Viroel (dviroel) wrote :
Revision history for this message
Harald Jensås (harald-jensas) wrote :
Download full text (5.4 KiB)

Looking at periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train:
  https://review.rdoproject.org/zuul/build/657be5515b6543488de53e032c03d5aa

The first attempt to introspect fails, on retry it succedes.

Initial attempt:

 DHCP is OK:
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) fa:16:3e:52:8f:ef
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPOFFER(br-ctlplane) 192.168.24.100 fa:16:3e:52:8f:ef
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) fa:16:3e:52:8f:ef
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPOFFER(br-ctlplane) 192.168.24.100 fa:16:3e:52:8f:ef
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPREQUEST(br-ctlplane) 192.168.24.100 fa:16:3e:52:8f:ef
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPACK(br-ctlplane) 192.168.24.100 fa:16:3e:52:8f:ef
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) fa:16:3e:58:d7:73
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPOFFER(br-ctlplane) 192.168.24.101 fa:16:3e:58:d7:73
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) fa:16:3e:58:d7:73
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPOFFER(br-ctlplane) 192.168.24.101 fa:16:3e:58:d7:73
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPREQUEST(br-ctlplane) 192.168.24.101 fa:16:3e:58:d7:73
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPACK(br-ctlplane) 192.168.24.101 fa:16:3e:58:d7:73
Oct 18 20:33:46 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) fa:16:3e:d2:e2:f7
Oct 18 20:33:46 dnsmasq-dhcp[7]: DHCPOFFER(br-ctlplane) 192.168.24.102 fa:16:3e:d2:e2:f7
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) fa:16:3e:41:08:50
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPOFFER(br-ctlplane) 192.168.24.103 fa:16:3e:41:08:50
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) fa:16:3e:d2:e2:f7
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPOFFER(br-ctlplane) 192.168.24.102 fa:16:3e:d2:e2:f7
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) fa:16:3e:41:08:50
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPOFFER(br-ctlplane) 192.168.24.103 fa:16:3e:41:08:50
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPREQUEST(br-ctlplane) 192.168.24.102 fa:16:3e:d2:e2:f7
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPACK(br-ctlplane) 192.168.24.102 fa:16:3e:d2:e2:f7
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) fa:16:3e:41:08:50
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPOFFER(br-ctlplane) 192.168.24.103 fa:16:3e:41:08:50
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPREQUEST(br-ctlplane) 192.168.24.102 fa:16:3e:d2:e2:f7
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPACK(br-ctlplane) 192.168.24.102 fa:16:3e:d2:e2:f7
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPREQUEST(br-ctlplane) 192.168.24.102 fa:16:3e:d2:e2:f7
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPACK(br-ctlplane) 192.168.24.102 fa:16:3e:d2:e2:f7
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPREQUEST(br-ctlplane) 192.168.24.102 fa:16:3e:d2:e2:f7
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPACK(br-ctlplane) 192.168.24.102 fa:16:3e:d2:e2:f7
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPREQUEST(br-ctlplane) 192.168.24.103 fa:16:3e:41:08:50
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPACK(br-ctlplane) 192.168.24.103 fa:16:3e:41:08:50
Oct 18 20:33:51 dnsmasq-dhcp[7]: DHCPDISCOVER(br-ctlplane) fa:16:3e:52:8f:ef
Oct 18 20:33:51 dnsmasq-dhcp[7]: DHCPOFFER(br-ctlplane) 192.168.24.100 fa:16:3e:52:8f:ef
Oct 18 20:33:5...

Read more...

Revision history for this message
Ronelle Landy (rlandy) wrote :

Comparing fs035 running on train vs master:

In master - takes about 5 minutes

 2021-10-19 10:12:55.255952 | primary | TASK [tripleo.operator.tripleo_overcloud_node_introspect : Introspect node] ****
2021-10-19 10:12:55.255960 | primary | Tuesday 19 October 2021 10:12:55 +0000 (0:00:00.512) 0:50:21.970 *******
2021-10-19 10:17:51.530372 | primary | changed: [undercloud]

On train - we're at over an hour:

2021-10-19 08:17:10.450161 | primary | TASK [tripleo.operator.tripleo_overcloud_node_introspect : Introspect node] ****
2021-10-19 08:17:10.450249 | primary | Tuesday 19 October 2021 08:17:10 +0000 (0:00:00.518) 0:50:40.110 *******
2021-10-19 09:22:01.509871 | primary | changed: [undercloud]

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

As a comparison: you can see the logs from train when introspection took anywhere from 5-20 minutes (as opposed to over an hour):

https://logserver.rdoproject.org/openstack-periodic-integration-stable4/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train/cd3d1a4/logs/

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master)
Revision history for this message
Alex Schultz (alex-schultz) wrote :
Download full text (8.0 KiB)

The introspection api connection is going away and failing. We retry but it never succeeds. It should be noted even on a passing jobs, we get timeouts waiting for introspections but we don't see this api remote end closed connection.

https://logserver.rdoproject.org/64/814464/2/openstack-check/tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset001/157ea66/logs/undercloud/var/log/containers/mistral/executor.log.txt.gz

2021-10-21 08:51:37.409 9 WARNING mistral.actions.openstack.base [req-f69826c4-66cf-4489-afe5-9e986efbe688 d29623a86a3c4983ad8e63b903df24d3 e53fb58fa2c8434c91c8c1646fae0dcc - default default] Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 416, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib64/python3.6/http/client.py", line 1365, in getresponse
    response.begin()
  File "/usr/lib64/python3.6/http/client.py", line 320, in begin
    version, status, reason = self._read_status()
  File "/usr/lib64/python3.6/http/client.py", line 289, in _read_status
    raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/usr/lib/python3.6/site-packages/urllib3/util/retry.py", line 400, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/lib/python3.6/site-packages/urllib3/packages/six.py", line 702, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    chunked=chunked,
  File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 421, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_...

Read more...

Revision history for this message
Rabi Mishra (rabi) wrote :

Seems like node take ages to power on for some reason.

2021-10-19 08:17:26.152 7 INFO ironic_inspector.introspect [-] [node: b981374c-3f42-4e99-8f85-c0b3109f4e38 state starting] Introspection started successfully
....
2021-10-19 09:21:23.181 7 INFO ironic_inspector.process [-] [node: b981374c-3f42-4e99-8f85-c0b3109f4e38 state finished MAC fa:16:3e:e2:d4:42] Introspection finished successfully

Ramdisk logs start from Oct 19 09:19:40

Oct 19 09:19:40 localhost.localdomain kernel: Linux version 4.18.0-338.el8.x86_64 (<email address hidden>) (gcc version 8.4.1 20200928 (Red Hat 8.4.1-1) (GCC)) #1 SMP Fri Aug 27 17:32:14 UTC 2021

Revision history for this message
Harald Jensås (harald-jensas) wrote :

I think the ages to boot is not quite accurrate, the node boots around:
2021-10-19 08:17:26.152 7 INFO ironic_inspector.introspect [-] [node: b981374c-3f42-4e99-8f85-c0b3109f4e38 state starting] Introspection started successfully

But we have no logs from that boot, we can see DHCP requests indicating and http transfers of inspector.ipxe and agent kernel/ramdisk succeedes. But the node never calls back, to fail/success the inspection. And also does not upload ramdisk logs.

There is a timeout, and inspection restarts - this time the node boots and inspection completes successfully. This time the ramdisk log is uploaded, which explain why we see "Ramdisk logs start from Oct 19 09:19:40"

I've pushed a some patches to try to improve the amount of data we capture in CI for inspection:

https://review.opendev.org/c/openstack/python-tripleoclient/+/814803
https://review.opendev.org/c/openstack/openstack-virtual-baremetal/+/814800
https://review.rdoproject.org/r/c/config/+/36355

That would give us the last 100 KiB of the console.log of each OVB Barmetal insance whenever it is powered off. 100 KiB because that is the hard-coded max console size in Nova: (MAX_CONSOLE_BYTES = 100 * units.Ki)

Since we can see ironic-python-agent is booting based on succesfull DHCP REQ/ACK's. It would be interresting to see what ironic-python-agent logs, why it is not able to callback to report success/failure.

Revision history for this message
Rabi Mishra (rabi) wrote :
Download full text (3.7 KiB)

> I think the ages to boot is not quite accurrate,

OK. I just noticed the following in conductor logs which made me think that probably some issue with powering on the nodes...

2021-10-19 08:18:11.128 8 DEBUG ironic.common.utils [-] Command stderr is: "" execute /usr/lib/python3.6/site-packages/ironic/common/utils.py:79
2021-10-19 08:18:11.129 8 ERROR oslo.service.loopingcall [-] Dynamic backoff interval looping call 'ironic.conductor.utils.node_wait_for_power_state.<locals>._wait' failed: oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 20.40 seconds
2021-10-19 08:18:11.129 8 ERROR oslo.service.loopingcall Traceback (most recent call last):
2021-10-19 08:18:11.129 8 ERROR oslo.service.loopingcall File "/usr/lib/python3.6/site-packages/oslo_service/loopingcall.py", line 154, in _run_loop
2021-10-19 08:18:11.129 8 ERROR oslo.service.loopingcall idle = idle_for_func(result, self._elapsed(watch))
2021-10-19 08:18:11.129 8 ERROR oslo.service.loopingcall File "/usr/lib/python3.6/site-packages/oslo_service/loopingcall.py", line 351, in _idle_for
2021-10-19 08:18:11.129 8 ERROR oslo.service.loopingcall % self._error_time)
2021-10-19 08:18:11.129 8 ERROR oslo.service.loopingcall oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 20.40 seconds
2021-10-19 08:18:11.129 8 ERROR oslo.service.loopingcall
2021-10-19 08:18:11.133 8 ERROR ironic.conductor.utils [req-06824ad3-b2ed-4557-a5b0-b8bfa0b3b061 4d8927747a344c9590e907366812a57a f0136f62e9b04cc6bee7ed60d56dd2c5 - default default] Timed out after 30 secs waiting for power off on node 102752b7-61e1-4a34-98a6-fe4e38b475ce.: oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 20.40 seconds
....

2021-10-19 08:19:28.677 8 WARNING ironic.conductor.manager [req-0daec60d-e154-48e1-8ca8-4fafcb28c015 - - - - -] During sync_power_state, node b981374c-3f42-4e99-8f85-c0b3109f4e38 state does not match expected state 'power on'. Updating recorded state to 'power off'.
2021-10-19 08:19:28.682 8 DEBUG ironic.conductor.task_manager [req-0daec60d-e154-48e1-8ca8-4fafcb28c015 - - - - -] Node 102752b7-61e1-4a34-98a6-fe4e38b475ce successfully reserved for power state sync (took 0.04 seconds) reserve_node /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:339
2021-10-19 08:19:28.682 8 WARNING ironic.conductor.manager [req-0daec60d-e154-48e1-8ca8-4fafcb28c015 - - - - -] During sync_power_state, node 102752b7-61e1-4a34-98a6-fe4e38b475ce state does not match expected state 'power on'. Updating recorded state to 'power off'.
2021-10-19 08:19:28.703 8 DEBUG ironic.conductor.task_manager [req-0daec60d-e154-48e1-8ca8-4fafcb28c015 - - - - -] Node e2b3a3d2-93aa-43a5-bc10-7c6a3559e765 successfully reserved for power state sync (took 0.03 seconds) reserve_node /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:339
2021-10-19 08:19:28.704 8 WARNING ironic.conductor.manager [req-0daec60d-e154-48e1-8ca8-4fafcb28c015 - - - - -] During sync_power_state, node e2b3a3d2-93aa-43a5-bc10-7c6a3559e765 state does not match expected state 'power on'. Updating recorded state to 'power off'.
...
2021-10-19 08:19:28.603 8 DEBUG ironic.common...

Read more...

Revision history for this message
Harald Jensås (harald-jensas) wrote :

2021-10-19 08:17:28.299 8 DEBUG ironic.conductor.task_manager [req-0daec60d-e154-48e1-8ca8-4fafcb28c015 - - - - -] Attempting to get shared lock on node 677d3491-814a-428e-8be6-f309d6ac95d6 (for power state sync) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:226

2021-10-19 08:17:32.110 8 DEBUG ironic.conductor.task_manager [req-0daec60d-e154-48e1-8ca8-4fafcb28c015 - - - - -] Successfully released shared lock for power state sync on node 677d3491-814a-428e-8be6-f309d6ac95d6 (lock was held 3.81 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:418

2021-10-19 08:17:41.816 8 DEBUG ironic.conductor.manager [req-e6bc6e68-907b-4cbb-96e8-59c96c998143 4d8927747a344c9590e907366812a57a f0136f62e9b04cc6bee7ed60d56dd2c5 - default default] RPC change_node_power_state called for node 677d3491-814a-428e-8be6-f309d6ac95d6. The desired new state is rebooting. change_node_power_state /usr/lib/python3.6/site-packages/ironic/conductor/manager.py:303

Desired state is rebooting, this should end up calling:
  ironic.drivers.modules.ipmitool.IPMIPower.reboot

The reboot method will call the following methods in the "good path":
  ironic.drivers.modules.ipmitool._power_off
  ironic.drivers.utils.ensure_next_boot_device
  ironic.drivers.modules.ipmitool._power_on

The _power_off method is calling:
  ironic.drivers.modules.ipmitool._set_and_wait
Which returns
  ironic.conductor.utils.node_wait_for_power_state

This is what raises this error:
2021-10-19 08:18:22.683 8 ERROR ironic.conductor.utils [req-e6bc6e68-907b-4cbb-96e8-59c96c998143 4d8927747a344c9590e907366812a57a f0136f62e9b04cc6bee7ed60d56dd2c5 - default default] Timed out after 30 secs waiting for power off on node 677d3491-814a-428e-8be6-f309d6ac95d6.: oslo_service.loopingcall.LoopingCallTimeOut: Looping call timed out after 20.60 seconds

The lock is released:
2021-10-19 08:18:22.734 8 DEBUG ironic.conductor.task_manager [req-e6bc6e68-907b-4cbb-96e8-59c96c998143 4d8927747a344c9590e907366812a57a f0136f62e9b04cc6bee7ed60d56dd2c5 - default default] Successfully released exclusive lock for changing node power state on node 677d3491-814a-428e-8be6-f309d6ac95d6 (lock was held 40.87 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:418

Result is that the "ironic.drivers.modules.ipmitool.IPMIPower.reboot" never gets to run:
  ironic.drivers.utils.ensure_next_boot_device
  ironic.drivers.modules.ipmitool._power_on

i.e the node remains powered off, the _power_on step of the reboot is never called.

We have 30 seconds in Train:
# Number of seconds to wait for power operations to complete,
# i.e., so that a baremetal node is in the desired power
# state. If timed out, the power operation is considered a
# failure. (integer value)
# Minimum value: 2
#power_state_change_timeout = 30

Upstream the default is 60 seconds, modified in Change-Id: I0aa5131504b60b13d43c73c9a3be1f50f7855cbc.

Let's try to bump that in CI?

Revision history for this message
Steve Baker (steve-stevebaker) wrote :

It sounds like there is a case for backporting [1] to train, and considering powerpc explicitly sets it to 60s[2] also then we should also add a THT change to add an IronicPowerStateChangeTimeout heat param which defaults to 60

[1] https://review.opendev.org/c/openstack/ironic/+/712592
[2] https://opendev.org/openstack/tripleo-heat-templates/src/branch/master/deployment/ironic/ironic-conductor-container-puppet.yaml#L387

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815570

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815570
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/65151adc182fbe28afc1348a1a2fb664255d5ee1
Submitter: "Zuul (22348)"
Branch: master

commit 65151adc182fbe28afc1348a1a2fb664255d5ee1
Author: Steve Baker <email address hidden>
Date: Wed Oct 27 12:48:36 2021 +1300

    Add parameter IronicPowerStateChangeTimeout

    A heat parameter ``IronicPowerStateChangeTimeout`` has been added
    which sets the number of seconds to wait for power operations to
    complete, i.e., so that a baremetal node is in the desired power
    state. If timed out, the power operation is considered a failure. The
    default is 60 seconds, which is the same as the current Ironic
    default.

    This could be backported to stable/train, where the ironic default of
    30s is causing failures in CI jobs.

    Related-Bug: #1947403
    Change-Id: I9729beac4c4e84d2619c4c629cd26eba8a26b87d

Revision history for this message
Douglas Viroel (dviroel) wrote :

I started a fix on ci to update power_state_change_timeout to 60s and start testing it.
The patches are here:
https://review.opendev.org/c/openstack/tripleo-ci/+/815537
https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/815540

In a first run we can see jobs is passing, but took more then 1h to finish overcloud node instropection:
"Wednesday 27 October 2021 02:07:29 +0000 (1:03:45.677) 1:48:43.634"
https://logserver.rdoproject.org/56/36356/3/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train/13d5d98/logs/quickstart_install.log

In the logs we still se one timeout error even setting to 60s:
"Timed out after 60 secs waiting for power off on node 0f3e7bc7-859b-44ec-9c00-903d2f419de9"
https://logserver.rdoproject.org/56/36356/3/check/periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train/13d5d98/logs/undercloud/var/log/extra/errors.txt.gz

With overcloud node introspection taking so long, this job is taking more then 4 hours to finish, when it doesn't times out.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/wallaby)

Related fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815798

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/victoria)

Related fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815799

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815800

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815801

Revision history for this message
Harald Jensås (harald-jensas) wrote :
Download full text (3.9 KiB)

Looking at the BMC log[1] the instances took very long to shutdown:

Instances power off time:
 cf56643a-932d-468a-95f5-e46cacd743a0 :: 71.645662 seconds
 3098f5ef-ceec-4b30-b54e-169a999444f7 :: 65.899861 seconds
 d9ececd0-50c5-4910-8619-df004e7c68c5 :: 64.710398 seconds
 5196ab60-6fb3-433b-a2c9-20ed85578577 :: 46.557931 seconds

62.203463 seconds/avg time to power off

If we compare that to a master job[2]:
 32153366-01d4-44fe-af19-b3c8513241bc :: 68.145963 seconds
 6301e44a-092c-4d90-956f-cbe6da0120da :: 120.570997 seconds
 a2c99a5b-3b8f-427a-bd05-c1c6f7def36f :: 120.436390 seconds
 36029b10-d981-4d02-83fc-8f5f8df8a351 :: 70.800235 seconds

94.98839625 seconds/avg time to power off

Now, another difference between train[1] and master[2] is that on master the BMC events occur in a different order.

In the train job we can see it is setting boot device to network, then comes the power off:

[ 3156.139144] openstackbmc[8149]: Set boot device to network
[ 3176.987111] openstackbmc[8192]: Powered off 5196ab60-6fb3-433b-a2c9-20ed85578577

In the master job[2] the "Set boot device to network" happens after the nodes has been powered off:

[ 2322.935851] openstackbmc[8281]: Powered off 32153366-01d4-44fe-af19-b3c8513241bc
[ 2391.081814] openstackbmc[8281]: Reporting power state "False" for instance 32153366-01d4-44fe-af19-b3c8513241bc
[ 2403.399431] openstackbmc[8281]: Set boot device to network

So, why does it work on master despite OVB nodes actually taking even longer to power off?

The workflow is different, the master job uses os_tripleo_baremetal_node_introspection which power off nodes prior to starting inspection in prepare_for_attempt[3] which has a very long timeout[4].

 1200 seconds !!!!!! Thats 20 minutes!!

Another interesting observation is that the initial power off in OVB seem to be very slow, but subsequent power offs are much quicker at ~5 seconds!!

INITIAL for node 36029b10-d981-4d02-83fc-8f5f8df8a351: (2397.319949-2326.519714 = 70.800235)
[ 2326.519714] openstackbmc[8186]: Powered off 36029b10-d981-4d02-83fc-8f5f8df8a351
[ 2397.319949] openstackbmc[8186]: Reporting power state "False" for instance 36029b10-d981-4d02-83fc-8f5f8df8a351
SECOND for node 36029b10-d981-4d02-83fc-8f5f8df8a351: (2553.212485−2547.906511 = 5.305974)
[ 2547.906511] openstackbmc[8186]: Powered off 36029b10-d981-4d02-83fc-8f5f8df8a351
[ 2553.212485] openstackbmc[8186]: Reporting power state "False" for instance 36029b10-d981-4d02-83fc-8f5f8df8a351

INITIAL for node a2c99a5b-3b8f-427a-bd05-c1c6f7def36f: (2443.568209-2323.131819 = 120.43639)
[ 2323.131819] openstackbmc[8146]: Powered off a2c99a5b-3b8f-427a-bd05-c1c6f7def36f
[ 2443.568209] openstackbmc[8146]: Reporting power state "False" for instance a2c99a5b-3b8f-427a-bd05-c1c6f7def36f
SECOND for node a2c99a5b-3b8f-427a-bd05-c1c6f7def36f: (2609.103883-2603.517865 = 5.586018)
[ 2603.517865] openstackbmc[8146]: Powered off a2c99a5b-3b8f-427a-bd05-c1c6f7def36f
[ 2609.103883] openstackbmc[8146]: Reporting power state "False" for instance a2c99a5b-3b8f-427a-bd05-c1c6f7def36f

Conclusion:
- bump power_state_change_timeout to 180 seconds in train?
- Since initial/secondary poweroffs have such a big timedi...

Read more...

Revision history for this message
Douglas Viroel (dviroel) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815799
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/8ee7df3a3d9452914da86d99aefcfcd1fed56e85
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit 8ee7df3a3d9452914da86d99aefcfcd1fed56e85
Author: Steve Baker <email address hidden>
Date: Wed Oct 27 12:48:36 2021 +1300

    Add parameter IronicPowerStateChangeTimeout

    A heat parameter ``IronicPowerStateChangeTimeout`` has been added
    which sets the number of seconds to wait for power operations to
    complete, i.e., so that a baremetal node is in the desired power
    state. If timed out, the power operation is considered a failure. The
    default is 60 seconds, which is the same as the current Ironic
    default.

    This could be backported to stable/train, where the ironic default of
    30s is causing failures in CI jobs.

    Related-Bug: #1947403
    Change-Id: I9729beac4c4e84d2619c4c629cd26eba8a26b87d
    (cherry picked from commit 65151adc182fbe28afc1348a1a2fb664255d5ee1)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815798
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/bb6ca08e494d5829630f1656be3f9be85e3f1324
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit bb6ca08e494d5829630f1656be3f9be85e3f1324
Author: Steve Baker <email address hidden>
Date: Wed Oct 27 12:48:36 2021 +1300

    Add parameter IronicPowerStateChangeTimeout

    A heat parameter ``IronicPowerStateChangeTimeout`` has been added
    which sets the number of seconds to wait for power operations to
    complete, i.e., so that a baremetal node is in the desired power
    state. If timed out, the power operation is considered a failure. The
    default is 60 seconds, which is the same as the current Ironic
    default.

    This could be backported to stable/train, where the ironic default of
    30s is causing failures in CI jobs.

    Related-Bug: #1947403
    Change-Id: I9729beac4c4e84d2619c4c629cd26eba8a26b87d
    (cherry picked from commit 65151adc182fbe28afc1348a1a2fb664255d5ee1)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815800
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/5bb52590f0aea189967e7434dd0413a885981eb2
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit 5bb52590f0aea189967e7434dd0413a885981eb2
Author: Steve Baker <email address hidden>
Date: Wed Oct 27 12:48:36 2021 +1300

    Add parameter IronicPowerStateChangeTimeout

    A heat parameter ``IronicPowerStateChangeTimeout`` has been added
    which sets the number of seconds to wait for power operations to
    complete, i.e., so that a baremetal node is in the desired power
    state. If timed out, the power operation is considered a failure. The
    default is 60 seconds, which is the same as the current Ironic
    default.

    This could be backported to stable/train, where the ironic default of
    30s is causing failures in CI jobs.

    Related-Bug: #1947403
    Change-Id: I9729beac4c4e84d2619c4c629cd26eba8a26b87d
    (cherry picked from commit 65151adc182fbe28afc1348a1a2fb664255d5ee1)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ci (master)

Reviewed: https://review.opendev.org/c/openstack/tripleo-ci/+/815537
Committed: https://opendev.org/openstack/tripleo-ci/commit/f0dec96d5f0898fcc1e6c93a5d11d5706c48d79a
Submitter: "Zuul (22348)"
Branch: master

commit f0dec96d5f0898fcc1e6c93a5d11d5706c48d79a
Author: Douglas Viroel <email address hidden>
Date: Tue Oct 26 14:45:49 2021 -0300

    Set ironic power state timeout to 180 in ovb jobs

    This patch sets undercloud_ironic_power_state_change_timeout to
    180 for all ovb jobs. See LP #1947403 for more info.

    Related-Bug: #1947403
    Depends-On: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/815540
    Change-Id: I76057481a53bc1541ff4abe80ccf5f8361317c4e
    Signed-off-by: Douglas Viroel <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/815801
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/667915a76d2540eab6f59e68e66ef7ebfa481c06
Submitter: "Zuul (22348)"
Branch: stable/train

commit 667915a76d2540eab6f59e68e66ef7ebfa481c06
Author: Steve Baker <email address hidden>
Date: Wed Oct 27 12:48:36 2021 +1300

    Add parameter IronicPowerStateChangeTimeout

    A heat parameter ``IronicPowerStateChangeTimeout`` has been added
    which sets the number of seconds to wait for power operations to
    complete, i.e., so that a baremetal node is in the desired power
    state. If timed out, the power operation is considered a failure. The
    default is 60 seconds, which is the same as the current Ironic
    default.

    This could be backported to stable/train, where the ironic default of
    30s is causing failures in CI jobs.

    Related-Bug: #1947403
    Change-Id: I9729beac4c4e84d2619c4c629cd26eba8a26b87d
    (cherry picked from commit 65151adc182fbe28afc1348a1a2fb664255d5ee1)

tags: added: in-stable-train
Revision history for this message
Ronelle Landy (rlandy) wrote :
Changed in tripleo:
status: Triaged → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master)

Change abandoned by "Ronelle Landy <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/tripleo-quickstart-extras/+/814641

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.