test_reboot_server_hard fails with AssertionError: time.struct_time() not greater than time.struct_time()

Bug #2033752 reported by yatin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Unassigned
neutron
Invalid
High
Unassigned
tempest
Invalid
Undecided
Unassigned

Bug Description

Seen many occurrences recently, fails as below:-

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 259, in test_reboot_server_hard
    self._test_reboot_server('HARD')
  File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 127, in _test_reboot_server
    self.assertGreater(new_boot_time, boot_time,
  File "/usr/lib/python3.10/unittest/case.py", line 1244, in assertGreater
    self.fail(self._formatMessage(msg, standardMsg))
  File "/usr/lib/python3.10/unittest/case.py", line 675, in fail
    raise self.failureException(msg)
AssertionError: time.struct_time(tm_year=2023, tm_mon=9, tm_mday=1, tm_hour=7, tm_min=26, tm_sec=33, tm_wday=4, tm_yday=244, tm_isdst=0) not greater than time.struct_time(tm_year=2023, tm_mon=9, tm_mday=1, tm_hour=7, tm_min=26, tm_sec=33, tm_wday=4, tm_yday=244, tm_isdst=0) : time.struct_time(tm_year=2023, tm_mon=9, tm_mday=1, tm_hour=7, tm_min=26, tm_sec=33, tm_wday=4, tm_yday=244, tm_isdst=0) > time.struct_time(tm_year=2023, tm_mon=9, tm_mday=1, tm_hour=7, tm_min=26, tm_sec=33, tm_wday=4, tm_yday=244, tm_isdst=0)

Example logs:-
https://1e11be38b60141dbb290-777f110ca49a5cd01022e1e8aeff1ed5.ssl.cf1.rackcdn.com/893401/5/check/neutron-ovn-tempest-ovs-release/f379752/testr_results.html
https://1b9f88b068db0ff45f98-b11b73e0c31560154dece88f25c72a10.ssl.cf2.rackcdn.com/893401/5/check/neutron-linuxbridge-tempest/0bf1039/testr_results.html
https://30b3c23edbff5d871c4c-595cfa47540877e41ce912cd21563e42.ssl.cf1.rackcdn.com/886988/10/check/neutron-ovs-tempest-multinode-full/e57a62a/testr_results.html
https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_0e5/886988/10/check/neutron-ovn-tempest-ipv6-only-ovs-release/0e538d1/testr_results.html

Opensearch:- https://opensearch.logs.openstack.org/_dashboards/app/discover/?security_tenant=global#/?_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-30d,to:now))&_a=(columns:!(_source),filters:!(),index:'94869730-aea8-11ec-9e6a-83741af3fdcd',interval:auto,query:(language:kuery,query:'message:%22not%20greater%20than%20time.struct_time%22'),sort:!())

As per opensearch it's started to be seen just few hours back.

Tags: tempest
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Probably due to the recent merge of https://review.opendev.org/c/openstack/nova/+/882284

Now, when rebooting, we call the Cinder API for checking the BDMs so it could be needing more time.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

https://github.com/openstack/tempest/blob/28ba6ee268bdf5d18a13a0dc7c7fea2c3a3dfcce/tempest/api/compute/servers/test_server_actions.py#L98-L128

In Tempest, this test is verifying that the reboot time is less than the first boot time.
Given now it can take more, that's why we have this issue, I guess.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

OK, nevermind all my previous comments (I'll hide them), I misunderstood what get_boot_time() [1] was doing, I thought it was calculating the duration of the boot. No, it's just calculating the timestamp when the server was booted and accordingly [2] is just doublechecking that the timestamp has changed because the server was rebooted.

This isn't at all a performance check, so I'm curious why we're getting those failures then.

[1] https://github.com/openstack/tempest/blob/28ba6ee268bdf5d18a13a0dc7c7fea2c3a3dfcce/tempest/api/compute/servers/test_server_actions.py#L109
[2] https://github.com/openstack/tempest/blob/28ba6ee268bdf5d18a13a0dc7c7fea2c3a3dfcce/tempest/api/compute/servers/test_server_actions.py#L127-L128

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

OK, I'm suspecting that we ssh the guest too early now in order to get the updated timestamp. If the reboot process takes a bit longer due to the Cinder check, then we would ssh the guest *before* it is actually rebooted, and accordingly we would get the boot time instead (which would fail the assertion due to equal times).

Fetching the logs to compare the timestamps.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote (last edit ):
Download full text (14.8 KiB)

No, this isn't even that. Found the reason, we failed on getting the volumev3 endpoint URL from service catalog when rebooting so we reverted the reboot (and accordingly the instance is back on ACTIVE state without having been rebooted) :

Sep 01 07:26:06.727199 np0035130323 nova-compute[66352]: DEBUG oslo_concurrency.lockutils [None req-50785304-109b-4a92-bf1a-f2efec407796 tempest-ServerActionsTestJSON-552227195 tempest-ServerActionsTestJSON-552227195-project-member] Acquiring lock "1ae52f43-5c8a-4402-8b53-b97d5e42f9d6" by "nova.compute.manager.ComputeManager.reboot_instance.<locals>.do_reboot_instance" {{(pid=66352) inner /opt/stack/data/venv/lib/python3.10/site-packages/oslo_concurrency/lockutils.py:404}}
Sep 01 07:26:06.727591 np0035130323 nova-compute[66352]: DEBUG oslo_concurrency.lockutils [None req-50785304-109b-4a92-bf1a-f2efec407796 tempest-ServerActionsTestJSON-552227195 tempest-ServerActionsTestJSON-552227195-project-member] Lock "1ae52f43-5c8a-4402-8b53-b97d5e42f9d6" acquired by "nova.compute.manager.ComputeManager.reboot_instance.<locals>.do_reboot_instance" :: waited 0.001s {{(pid=66352) inner /opt/stack/data/venv/lib/python3.10/site-packages/oslo_concurrency/lockutils.py:409}}
Sep 01 07:26:06.727979 np0035130323 nova-compute[66352]: INFO nova.compute.manager [None req-50785304-109b-4a92-bf1a-f2efec407796 tempest-ServerActionsTestJSON-552227195 tempest-ServerActionsTestJSON-552227195-project-member] [instance: 1ae52f43-5c8a-4402-8b53-b97d5e42f9d6] Rebooting instance
Sep 01 07:26:06.740189 np0035130323 nova-compute[66352]: DEBUG oslo_concurrency.lockutils [None req-50785304-109b-4a92-bf1a-f2efec407796 tempest-ServerActionsTestJSON-552227195 tempest-ServerActionsTestJSON-552227195-project-member] Lock "1ae52f43-5c8a-4402-8b53-b97d5e42f9d6" "released" by "nova.compute.manager.ComputeManager.reboot_instance.<locals>.do_reboot_instance" :: held 0.013s {{(pid=66352) inner /opt/stack/data/venv/lib/python3.10/site-packages/oslo_concurrency/lockutils.py:423}}
Sep 01 07:26:06.816412 np0035130323 nova-compute[66352]: DEBUG oslo_concurrency.lockutils [None req-50785304-109b-4a92-bf1a-f2efec407796 tempest-ServerActionsTestJSON-552227195 tempest-ServerActionsTestJSON-552227195-project-member] Acquiring lock "compute_resources" by "nova.compute.resource_tracker.ResourceTracker.update_usage" {{(pid=66352) inner /opt/stack/data/venv/lib/python3.10/site-packages/oslo_concurrency/lockutils.py:404}}
Sep 01 07:26:06.816862 np0035130323 nova-compute[66352]: DEBUG oslo_concurrency.lockutils [None req-50785304-109b-4a92-bf1a-f2efec407796 tempest-ServerActionsTestJSON-552227195 tempest-ServerActionsTestJSON-552227195-project-member] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker.update_usage" :: waited 0.001s {{(pid=66352) inner /opt/stack/data/venv/lib/python3.10/site-packages/oslo_concurrency/lockutils.py:409}}
Sep 01 07:26:07.129230 np0035130323 nova-compute[66352]: DEBUG nova.compute.provider_tree [None req-50785304-109b-4a92-bf1a-f2efec407796 tempest-ServerActionsTestJSON-552227195 tempest-ServerActionsTestJSON-552227195-project-member] Inventory has not changed in ProviderTree for provider: 4920f32e-605a-4c1a...

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

OK, this is normal : we always assume that all devstack jobs deploy Cinder, but heh, not at all :

"disable_service cinder"
https://1e11be38b60141dbb290-777f110ca49a5cd01022e1e8aeff1ed5.ssl.cf1.rackcdn.com/893401/5/check/neutron-ovn-tempest-ovs-release/f379752/controller/logs/local_conf.txt

So, yeah, this is a Nova bug : we shouldn't unconditionnally call Cinder if Cinder isn't deployed. We'll need to handle this somehow in the code.

Changed in nova:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Putting critical since hard reboot no longer works if you haven't deployed Cinder.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Okay, in shutdown, we just wrap all the potential Cinder exceptions we may have :
https://github.com/openstack/nova/blob/88c73e2931dcb276982612dbbdc3b849d4977c16/nova/compute/manager.py#L3164-L3188

This sounds a bit huge and maybe we should focus on only the EndpointNotFound and ClientException exceptions, but I think we should mimic this exception handling in reboot.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/893489

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Amit Uniyal <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/893489

Changed in neutron:
importance: Undecided → High
tags: added: tempest
Changed in neutron:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/893502
Committed: https://opendev.org/openstack/nova/commit/68b2131d811c935af60eb6f563f4552d7287e3a8
Submitter: "Zuul (22348)"
Branch: master

commit 68b2131d811c935af60eb6f563f4552d7287e3a8
Author: Sean Mooney <email address hidden>
Date: Fri Sep 1 12:46:21 2023 +0100

    only attempt to clean up dangling bdm if cinder is installed

    This change ensure we only try to clean up dangling bdms if
    cinder is installed and reachable.

    Closes-Bug: #2033752
    Change-Id: I0ada59d8901f8620fd1f3dc20d6be303aa7dabca

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
yatin (yatinkarel) wrote :

Closed for tempest and neutron, as regression is fixed in nova with https://review.opendev.org/c/openstack/nova/+/893502, jobs are back to green.

Changed in neutron:
status: Confirmed → Invalid
Changed in tempest:
status: New → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 28.0.0.0rc1

This issue was fixed in the openstack/nova 28.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.