Cinder

Rebuild of volume-backed server times out in gate run

Bug #2046252 reported by Rajat Dhasmana on 2023-12-12

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Cinder	New	Undecided	Unassigned

Bug Description

Job: tempest-integrated-compute

Issue: Rebuild test fails

tempest.api.compute.servers.test_server_actions.ServerActionsV293TestJSON.test_rebuild_volume_backed_server

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/api/compute/servers/test_server_actions.py", line 894, in test_rebuild_volume_backed_server
    self.assertEqual(self.image_ref_alt, image_after_rebuild)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.10/site-packages/testtools/testcase.py", line 394, in assertEqual
    self.assertThat(observed, matcher, message)
  File "/opt/stack/tempest/.tox/tempest/lib/python3.10/site-packages/testtools/testcase.py", line 481, in assertThat
    raise mismatch_error
testtools.matchers._impl.MismatchError: !=:
reference = 'dbe01f18-1c90-4536-a09a-b49f0811c7a0'
actual = '5071b70c-80f6-496a-81ac-cf35a71c1fa8'

========================================================

Analysis
--------

NOVA (req-5113eccb-05ba-486a-8130-a58898c8ad35)

Nova waits 60 seconds for the reimage to complete

Dec 05 13:27:17.890624 np0035951188 nova-compute[74920]: DEBUG nova.compute.manager [None req-5113eccb-05ba-486a-8130-a58898c8ad35 tempest-ServerActionsV293TestJSON-1780705112 tempest-ServerActionsV293TestJSON-1780705112-project-member] [instance: d9889f9f-f0b8-4b92-b3d8-7c4879f0b5b4] Preparing to wait for external event volume-reimaged-073a98e8-3c89-4734-9ae5-59af25f8914a {{(pid=74920) prepare_for_instance_event /opt/stack/nova/nova/compute/manager.py:283}}

and after 60 seconds, times out

Dec 05 13:28:18.119292 np0035951188 nova-compute[74920]: WARNING nova.compute.manager [None req-5113eccb-05ba-486a-8130-a58898c8ad35 tempest-ServerActionsV293TestJSON-1780705112 tempest-ServerActionsV293TestJSON-1780705112-project-member] [instance: d9889f9f-f0b8-4b92-b3d8-7c4879f0b5b4] Timeout waiting for ['volume-reimaged-073a98e8-3c89-4734-9ae5-59af25f8914a'] for instance with vm_state active and task_state rebuilding. Event states are: volume-reimaged-073a98e8-3c89-4734-9ae5-59af25f8914a: timed out after 60.00 seconds: eventlet.timeout.Timeout: 60 seconds

Dec 05 13:28:19.411170 np0035951188 nova-compute[74920]: ERROR oslo_messaging.rpc.server [None req-5113eccb-05ba-486a-8130-a58898c8ad35 tempest-ServerActionsV293TestJSON-1780705112 tempest-ServerActionsV293TestJSON-1780705112-project-member] Exception during message handling: ValueError: Circular reference detected

CINDER (req-5113eccb-05ba-486a-8130-a58898c8ad35)

Start time: 13:27:15

Dec 05 13:27:15.391096 np0035951188 cinder-volume[77000]: DEBUG oslo_concurrency.lockutils [req-5113eccb-05ba-486a-8130-a58898c8ad35 req-58918ee8-9e2c-4fc6-a6da-f1acf405ebcd tempest-ServerActionsV293TestJSON-1780705112 None] Acquiring lock "lioadm" by "cinder.volume.targets.lio.LioAdm._execute" {{(pid=77000) inner /opt/stack/data/venv/lib/python3.10/site-packages/oslo_concurrency/lockutils.py:404}}

Failure time:13:30:00 (Due to volume being deleted since the nova rebuild operation failed)

Dec 05 13:30:00.365112 np0035951188 cinder-volume[77000]: ERROR oslo_messaging.rpc.server [req-5113eccb-05ba-486a-8130-a58898c8ad35 req-0edf972a-109a-465f-a771-ceb87ecbda3e tempest-ServerActionsV293TestJSON-1780705112 None] Exception during message handling: cinder.exception.VolumeNotFound: Volume 073a98e8-3c89-4734-9ae5-59af25f8914a could not be found.

Total execution time: 165 seconds

Reason for failure:

Cinder image conversion takes 159 seconds

Dec 05 13:29:59.709129 np0035951188 cinder-volume[77000]: DEBUG oslo_concurrency.processutils [req-5113eccb-05ba-486a-8130-a58898c8ad35 req-0edf972a-109a-465f-a771-ceb87ecbda3e tempest-ServerActionsV293TestJSON-1780705112 None] CMD "sudo cinder-rootwrap /etc/cinder/rootwrap.conf qemu-img convert -O raw -t none -f qcow2 /opt/stack/data/cinder/conversion/image_download_dbe01f18-1c90-4536-a09a-b49f0811c7a0_copod3cm /dev/mapper/stack--volumes--lvmdriver--1-volume--073a98e8--3c89--4734--9ae5--59af25f8914a" returned: 0 in 159.272s {{(pid=77000) execute /opt/stack/data/venv/lib/python3.10/site-packages/oslo_concurrency/processutils.py:422}}

resolution: We need to increase the nova timeout for the reimage operation in our gate jobs

Revision history for this message

Gökhan (skylightcoder) wrote on 2024-04-16:

I am hitting same issue. timeout waiting for ['volume-reimaged-0ededc0c-4c7d-4f9e-9a5c-599e39652afc'] for instance with vm_state active and task_state rebuilding. Event states are: volume-reimaged-0ededc0c-4c7d-4f9e-9a5c-599e39652afc: timed out after 0.00 seconds: eventlet.timeout.Timeout: 0 seconds
Apr 16 09:05:02 dev-compute2 nova-compute[1450173]: /openstack/venvs/nova-27.3.0/lib/python3.10/site-packages/oslo_serialization/jsonutils.py:180: UserWarning: Cannot convert <Timeout at 0x7c4823a08280 seconds=0> to primitive, will raise ValueError instead of warning in version 3.0. Full logs are in https://paste.openstack.org/show/biPoKAoIZuDRDOTSvM1Y/.

How can we increase timeout for the reimage operation?

Revision history for this message

Rajat Dhasmana (whoami-rajat) wrote on 2024-04-16:

Hi Gokhan,

It is really strange to see a timeout of 0 seconds since the default value is 20 seconds per GB.
The config option is reimage_timeout_per_gb[1] and the value will multiply with the size of image in GBs.

Example: if you set 60 seconds per GB and the image size is 10 GB then the timeout will be 600 seconds (or 10 minutes).

[1] https://opendev.org/openstack/nova/src/branch/master/nova/conf/compute.py#L307-L321

Revision history for this message

Gökhan (skylightcoder) wrote on 2024-04-16:

Hi Rajat,
Yes, timeout of 0 seconds is really weird. I changed reimage_timeout_per_gb = 180 in nova.conf but it didn't work. May be eventlet is problem I am not sure. My deployment is antelope version.

Revision history for this message

Gökhan (skylightcoder) wrote on 2024-04-17:

Hi Rajat,
I find the issue.
I am getting this strange timeout 0 error when I try to rebuild volume backed server from its snapshot.

Steps I followed is below:
1. Create snapshot of volume backed server.
this action creates both volume snapshot and instance snapshot with 0 image size.

2. rebuild volume backed server with instance snapshot with 0 image size.

If I rebuild volume backed server with image size is bigger than 0, it works when I give specific timeout setting.

How can we restore from its snapshot?

Revision history for this message

Rajat Dhasmana (whoami-rajat) wrote on 2024-04-18:

Hi Gokhan,

Thanks for testing out this workflow.
Indeed we have a bug in the code path that i was able to reproduce and I will work on fixing it.
In the meantime, can you report a new bug with the details of your testing.

Revision history for this message

Gökhan (skylightcoder) wrote on 2024-04-18:

Hi Rajat,

Thanks for your interest.
I filled bug in https://bugs.launchpad.net/cinder/+bug/2062127 .

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.