OVB jobs are consistently failing on tripleo overcloud node introspect: "Introspection timeout"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Unassigned |
Bug Description
Some OVB jobs are failing on overcloud node introspect:
TASK [tripleo.
Friday 15 October 2021 08:16:23 +0000 (0:00:00.450) 0:44:53.423 ********
fatal: [undercloud]: FAILED! => {
"changed": true,
"cmd": " openstack overcloud node introspect --all-manageable --provide >/home/
"delta": "1:03:12.304823",
"end": "2021-10-15 09:19:36.242646",
"rc": 1,
"start": "2021-10-15 08:16:23.937823"
}
Introspection completed with errors:
fd4e9887-
51524224-
c776ba5d-
Pooja Jadhav (poojajadhav22) wrote : | #1 |
Douglas Viroel (dviroel) wrote : | #2 |
featureset035 continues to fail on instrospect node task here:
https:/
Build history:
https:/
Harald Jensås (harald-jensas) wrote : | #3 |
Looking at periodic-
https:/
The first attempt to introspect fails, on retry it succedes.
Initial attempt:
DHCP is OK:
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPDISCOVER(
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPOFFER(
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPDISCOVER(
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPOFFER(
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPREQUEST(
Oct 18 20:33:24 dnsmasq-dhcp[7]: DHCPACK(
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPDISCOVER(
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPOFFER(
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPDISCOVER(
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPOFFER(
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPREQUEST(
Oct 18 20:33:28 dnsmasq-dhcp[7]: DHCPACK(
Oct 18 20:33:46 dnsmasq-dhcp[7]: DHCPDISCOVER(
Oct 18 20:33:46 dnsmasq-dhcp[7]: DHCPOFFER(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPDISCOVER(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPOFFER(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPDISCOVER(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPOFFER(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPDISCOVER(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPOFFER(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPREQUEST(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPACK(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPDISCOVER(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPOFFER(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPREQUEST(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPACK(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPREQUEST(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPACK(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPREQUEST(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPACK(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPREQUEST(
Oct 18 20:33:49 dnsmasq-dhcp[7]: DHCPACK(
Oct 18 20:33:51 dnsmasq-dhcp[7]: DHCPDISCOVER(
Oct 18 20:33:51 dnsmasq-dhcp[7]: DHCPOFFER(
Oct 18 20:33:5...
Ronelle Landy (rlandy) wrote : | #4 |
Comparing fs035 running on train vs master:
In master - takes about 5 minutes
2021-10-19 10:12:55.255952 | primary | TASK [tripleo.
2021-10-19 10:12:55.255960 | primary | Tuesday 19 October 2021 10:12:55 +0000 (0:00:00.512) 0:50:21.970 *******
2021-10-19 10:17:51.530372 | primary | changed: [undercloud]
On train - we're at over an hour:
2021-10-19 08:17:10.450161 | primary | TASK [tripleo.
2021-10-19 08:17:10.450249 | primary | Tuesday 19 October 2021 08:17:10 +0000 (0:00:00.518) 0:50:40.110 *******
2021-10-19 09:22:01.509871 | primary | changed: [undercloud]
Ronelle Landy (rlandy) wrote : | #5 |
OVB console logs:
Ronelle Landy (rlandy) wrote : | #6 |
As a comparison: you can see the logs from train when introspection took anywhere from 5-20 minutes (as opposed to over an hour):
Ronelle Landy (rlandy) wrote : | #7 |
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-quickstart-extras (master) | #8 |
Related fix proposed to branch: master
Review: https:/
Alex Schultz (alex-schultz) wrote : | #9 |
The introspection api connection is going away and failing. We retry but it never succeeds. It should be noted even on a passing jobs, we get timeouts waiting for introspections but we don't see this api remote end closed connection.
2021-10-21 08:51:37.409 9 WARNING mistral.
File "/usr/lib/
chunked=
File "/usr/lib/
six.
File "<string>", line 3, in raise_from
File "/usr/lib/
httplib_
File "/usr/lib64/
response.
File "/usr/lib64/
version, status, reason = self._read_status()
File "/usr/lib64/
raise RemoteDisconnec
http.client.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/
timeout=timeout
File "/usr/lib/
method, url, error=e, _pool=self, _stacktrace=
File "/usr/lib/
raise six.reraise(
File "/usr/lib/
raise value.with_
File "/usr/lib/
chunked=
File "/usr/lib/
six.
File "<string>", line 3, in raise_...
Rabi Mishra (rabi) wrote : | #10 |
Seems like node take ages to power on for some reason.
2021-10-19 08:17:26.152 7 INFO ironic_
....
2021-10-19 09:21:23.181 7 INFO ironic_
Ramdisk logs start from Oct 19 09:19:40
Oct 19 09:19:40 localhost.
Harald Jensås (harald-jensas) wrote : | #11 |
I think the ages to boot is not quite accurrate, the node boots around:
2021-10-19 08:17:26.152 7 INFO ironic_
But we have no logs from that boot, we can see DHCP requests indicating and http transfers of inspector.ipxe and agent kernel/ramdisk succeedes. But the node never calls back, to fail/success the inspection. And also does not upload ramdisk logs.
There is a timeout, and inspection restarts - this time the node boots and inspection completes successfully. This time the ramdisk log is uploaded, which explain why we see "Ramdisk logs start from Oct 19 09:19:40"
I've pushed a some patches to try to improve the amount of data we capture in CI for inspection:
https:/
https:/
https:/
That would give us the last 100 KiB of the console.log of each OVB Barmetal insance whenever it is powered off. 100 KiB because that is the hard-coded max console size in Nova: (MAX_CONSOLE_BYTES = 100 * units.Ki)
Since we can see ironic-python-agent is booting based on succesfull DHCP REQ/ACK's. It would be interresting to see what ironic-python-agent logs, why it is not able to callback to report success/failure.
Rabi Mishra (rabi) wrote : | #12 |
> I think the ages to boot is not quite accurrate,
OK. I just noticed the following in conductor logs which made me think that probably some issue with powering on the nodes...
2021-10-19 08:18:11.128 8 DEBUG ironic.common.utils [-] Command stderr is: "" execute /usr/lib/
2021-10-19 08:18:11.129 8 ERROR oslo.service.
2021-10-19 08:18:11.129 8 ERROR oslo.service.
2021-10-19 08:18:11.129 8 ERROR oslo.service.
2021-10-19 08:18:11.129 8 ERROR oslo.service.
2021-10-19 08:18:11.129 8 ERROR oslo.service.
2021-10-19 08:18:11.129 8 ERROR oslo.service.
2021-10-19 08:18:11.129 8 ERROR oslo.service.
2021-10-19 08:18:11.129 8 ERROR oslo.service.
2021-10-19 08:18:11.133 8 ERROR ironic.
....
2021-10-19 08:19:28.677 8 WARNING ironic.
2021-10-19 08:19:28.682 8 DEBUG ironic.
2021-10-19 08:19:28.682 8 WARNING ironic.
2021-10-19 08:19:28.703 8 DEBUG ironic.
2021-10-19 08:19:28.704 8 WARNING ironic.
...
2021-10-19 08:19:28.603 8 DEBUG ironic.common...
Harald Jensås (harald-jensas) wrote : | #13 |
2021-10-19 08:17:28.299 8 DEBUG ironic.
2021-10-19 08:17:32.110 8 DEBUG ironic.
2021-10-19 08:17:41.816 8 DEBUG ironic.
Desired state is rebooting, this should end up calling:
ironic.
The reboot method will call the following methods in the "good path":
ironic.
ironic.
ironic.
The _power_off method is calling:
ironic.
Which returns
ironic.
This is what raises this error:
2021-10-19 08:18:22.683 8 ERROR ironic.
The lock is released:
2021-10-19 08:18:22.734 8 DEBUG ironic.
Result is that the "ironic.
ironic.
ironic.
i.e the node remains powered off, the _power_on step of the reboot is never called.
We have 30 seconds in Train:
# Number of seconds to wait for power operations to complete,
# i.e., so that a baremetal node is in the desired power
# state. If timed out, the power operation is considered a
# failure. (integer value)
# Minimum value: 2
#power_
Upstream the default is 60 seconds, modified in Change-Id: I0aa5131504b60b
Let's try to bump that in CI?
Steve Baker (steve-stevebaker) wrote : | #14 |
It sounds like there is a case for backporting [1] to train, and considering powerpc explicitly sets it to 60s[2] also then we should also add a THT change to add an IronicPowerStat
[1] https:/
[2] https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master) | #15 |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master) | #16 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 65151adc182fbe2
Author: Steve Baker <email address hidden>
Date: Wed Oct 27 12:48:36 2021 +1300
Add parameter IronicPowerStat
A heat parameter ``IronicPowerSt
which sets the number of seconds to wait for power operations to
complete, i.e., so that a baremetal node is in the desired power
state. If timed out, the power operation is considered a failure. The
default is 60 seconds, which is the same as the current Ironic
default.
This could be backported to stable/train, where the ironic default of
30s is causing failures in CI jobs.
Related-Bug: #1947403
Change-Id: I9729beac4c4e84
Douglas Viroel (dviroel) wrote : | #17 |
I started a fix on ci to update power_state_
The patches are here:
https:/
https:/
In a first run we can see jobs is passing, but took more then 1h to finish overcloud node instropection:
"Wednesday 27 October 2021 02:07:29 +0000 (1:03:45.677) 1:48:43.634"
https:/
In the logs we still se one timeout error even setting to 60s:
"Timed out after 60 secs waiting for power off on node 0f3e7bc7-
https:/
With overcloud node introspection taking so long, this job is taking more then 4 hours to finish, when it doesn't times out.
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/wallaby) | #18 |
Related fix proposed to branch: stable/wallaby
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/victoria) | #19 |
Related fix proposed to branch: stable/victoria
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/ussuri) | #20 |
Related fix proposed to branch: stable/ussuri
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (stable/train) | #21 |
Related fix proposed to branch: stable/train
Review: https:/
Harald Jensås (harald-jensas) wrote : | #22 |
Looking at the BMC log[1] the instances took very long to shutdown:
Instances power off time:
cf56643a-
3098f5ef-
d9ececd0-
5196ab60-
62.203463 seconds/avg time to power off
If we compare that to a master job[2]:
32153366-
6301e44a-
a2c99a5b-
36029b10-
94.98839625 seconds/avg time to power off
Now, another difference between train[1] and master[2] is that on master the BMC events occur in a different order.
In the train job we can see it is setting boot device to network, then comes the power off:
[ 3156.139144] openstackbmc[8149]: Set boot device to network
[ 3176.987111] openstackbmc[8192]: Powered off 5196ab60-
In the master job[2] the "Set boot device to network" happens after the nodes has been powered off:
[ 2322.935851] openstackbmc[8281]: Powered off 32153366-
[ 2391.081814] openstackbmc[8281]: Reporting power state "False" for instance 32153366-
[ 2403.399431] openstackbmc[8281]: Set boot device to network
So, why does it work on master despite OVB nodes actually taking even longer to power off?
The workflow is different, the master job uses os_tripleo_
1200 seconds !!!!!! Thats 20 minutes!!
Another interesting observation is that the initial power off in OVB seem to be very slow, but subsequent power offs are much quicker at ~5 seconds!!
INITIAL for node 36029b10-
[ 2326.519714] openstackbmc[8186]: Powered off 36029b10-
[ 2397.319949] openstackbmc[8186]: Reporting power state "False" for instance 36029b10-
SECOND for node 36029b10-
[ 2547.906511] openstackbmc[8186]: Powered off 36029b10-
[ 2553.212485] openstackbmc[8186]: Reporting power state "False" for instance 36029b10-
INITIAL for node a2c99a5b-
[ 2323.131819] openstackbmc[8146]: Powered off a2c99a5b-
[ 2443.568209] openstackbmc[8146]: Reporting power state "False" for instance a2c99a5b-
SECOND for node a2c99a5b-
[ 2603.517865] openstackbmc[8146]: Powered off a2c99a5b-
[ 2609.103883] openstackbmc[8146]: Reporting power state "False" for instance a2c99a5b-
Conclusion:
- bump power_state_
- Since initial/secondary poweroffs have such a big timedi...
Douglas Viroel (dviroel) wrote : | #23 |
Setting the timeout to 180s looks promising, we have two consecutive successful runs:
https:/
https:/
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/victoria) | #24 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/victoria
commit 8ee7df3a3d94529
Author: Steve Baker <email address hidden>
Date: Wed Oct 27 12:48:36 2021 +1300
Add parameter IronicPowerStat
A heat parameter ``IronicPowerSt
which sets the number of seconds to wait for power operations to
complete, i.e., so that a baremetal node is in the desired power
state. If timed out, the power operation is considered a failure. The
default is 60 seconds, which is the same as the current Ironic
default.
This could be backported to stable/train, where the ironic default of
30s is causing failures in CI jobs.
Related-Bug: #1947403
Change-Id: I9729beac4c4e84
(cherry picked from commit 65151adc182fbe2
tags: | added: in-stable-victoria |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/wallaby) | #25 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/wallaby
commit bb6ca08e494d582
Author: Steve Baker <email address hidden>
Date: Wed Oct 27 12:48:36 2021 +1300
Add parameter IronicPowerStat
A heat parameter ``IronicPowerSt
which sets the number of seconds to wait for power operations to
complete, i.e., so that a baremetal node is in the desired power
state. If timed out, the power operation is considered a failure. The
default is 60 seconds, which is the same as the current Ironic
default.
This could be backported to stable/train, where the ironic default of
30s is causing failures in CI jobs.
Related-Bug: #1947403
Change-Id: I9729beac4c4e84
(cherry picked from commit 65151adc182fbe2
tags: | added: in-stable-wallaby |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/ussuri) | #26 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/ussuri
commit 5bb52590f0aea18
Author: Steve Baker <email address hidden>
Date: Wed Oct 27 12:48:36 2021 +1300
Add parameter IronicPowerStat
A heat parameter ``IronicPowerSt
which sets the number of seconds to wait for power operations to
complete, i.e., so that a baremetal node is in the desired power
state. If timed out, the power operation is considered a failure. The
default is 60 seconds, which is the same as the current Ironic
default.
This could be backported to stable/train, where the ironic default of
30s is causing failures in CI jobs.
Related-Bug: #1947403
Change-Id: I9729beac4c4e84
(cherry picked from commit 65151adc182fbe2
tags: | added: in-stable-ussuri |
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-ci (master) | #27 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit f0dec96d5f0898f
Author: Douglas Viroel <email address hidden>
Date: Tue Oct 26 14:45:49 2021 -0300
Set ironic power state timeout to 180 in ovb jobs
This patch sets undercloud_
180 for all ovb jobs. See LP #1947403 for more info.
Related-Bug: #1947403
Depends-On: https:/
Change-Id: I76057481a53bc1
Signed-off-by: Douglas Viroel <email address hidden>
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (stable/train) | #28 |
Reviewed: https:/
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/train
commit 667915a76d2540e
Author: Steve Baker <email address hidden>
Date: Wed Oct 27 12:48:36 2021 +1300
Add parameter IronicPowerStat
A heat parameter ``IronicPowerSt
which sets the number of seconds to wait for power operations to
complete, i.e., so that a baremetal node is in the desired power
state. If timed out, the power operation is considered a failure. The
default is 60 seconds, which is the same as the current Ironic
default.
This could be backported to stable/train, where the ironic default of
30s is causing failures in CI jobs.
Related-Bug: #1947403
Change-Id: I9729beac4c4e84
(cherry picked from commit 65151adc182fbe2
tags: | added: in-stable-train |
Ronelle Landy (rlandy) wrote : | #29 |
fs035 on train is passing now - closing this out
Changed in tripleo: | |
status: | Triaged → Fix Released |
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-quickstart-extras (master) | #30 |
Change abandoned by "Ronelle Landy <email address hidden>" on branch: master
Review: https:/
periodic- tripleo- ci-centos- 8-ovb-1ctlr_ 2comp-featurese t020-train started passing from 3 runs :
https:/ /review. rdoproject. org/zuul/ builds? job_name= periodic- tripleo- ci-centos- 8-ovb-1ctlr_ 2comp-featurese t020-train
https:/ /logserver. rdoproject. org/openstack- periodic- integration- stable4/ opendev. org/openstack/ tripleo- ci/master/ periodic- tripleo- ci-centos- 8-ovb-1ctlr_ 2comp-featurese t020-train/ f8368be/
https:/ /logserver. rdoproject. org/openstack- periodic- integration- stable4/ opendev. org/openstack/ tripleo- ci/master/ periodic- tripleo- ci-centos- 8-ovb-1ctlr_ 2comp-featurese t020-train/ 85f2757/
https:/ /logserver. rdoproject. org/openstack- periodic- integration- stable4/ opendev. org/openstack/ tripleo- ci/master/ periodic- tripleo- ci-centos- 8-ovb-1ctlr_ 2comp-featurese t020-train/ 5d15121/
We should close this bug.