nova-lvm job failing for volume detach timeout

Bug #1998148 reported by Ghanshyam Mann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
High
Jorge San Emeterio

Bug Description

nova-lvm job started failing 100% on few volume detach tests. It started failing when we move it to run on Ubuntu jammy (22.04).

Below tests are failing::

tearDownClass (tempest.api.compute.servers.test_server_rescue ServerStableDeviceRescueTest)

tempest.api.compute.servers.test_server_rescue.ServerStableDeviceRescueTest.test_stable_device_rescue_disk_virtio_with_volume_attached[id-a3772b42-00bf-4310-a90b-1cc6fd3e7eab,volume]

tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume[id-f56e465b-fe10-48bf-b75d-646cda3a8bc9,negative,volume]

tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_attach_detach_volume[id-52e9045a-e90d-4c0d-9087-79d657faffff]

traceback-1: {{{
Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/common/waiters.py", line 405, in wait_for_volume_attachment_remove_from_server
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: Volume 8bb34af9-d48d-4ee4-b534-2356b08e3a65 failed to detach from server c4cab756-e4b2-44c9-9a50-6115608cf467 within the required time (196 s) from the compute API perspective
}}}

traceback-2: {{{
Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/common/waiters.py", line 337, in wait_for_volume_resource_status
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: volume 8bb34af9-d48d-4ee4-b534-2356b08e3a65 failed to reach available status (current in-use) within the required time (196 s).
}}}

Traceback (most recent call last):
  File "/opt/stack/tempest/tempest/api/compute/volumes/test_attach_volume.py", line 115, in test_attach_detach_volume
    waiters.wait_for_volume_resource_status(
  File "/opt/stack/tempest/tempest/common/waiters.py", line 337, in wait_for_volume_resource_status
    raise lib_exc.TimeoutException(message)
tempest.lib.exceptions.TimeoutException: Request timed out
Details: volume 8bb34af9-d48d-4ee4-b534-2356b08e3a65 failed to reach available status (current in-use) within the required time (196 s).

https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_f07/865658/5/check/nova-lvm/f076244/testr_results.html

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/865922

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/865922
Committed: https://opendev.org/openstack/nova/commit/2aa779016d79b97f61677293d779fd25d9b37e9f
Submitter: "Zuul (22348)"
Branch: master

commit 2aa779016d79b97f61677293d779fd25d9b37e9f
Author: Ghanshyam Mann <email address hidden>
Date: Mon Nov 28 17:54:45 2022 -0600

    Temporary skip some volume detach test in nova-lvm job

    A few tests related to volume detach are timeout in
    nova-lvm job (failing 100%[1]). Root cause of timeout is not
    known and it may take time to find and fix the issue. To unblock
    gate and keep runing rest of the tests in lvm job, let's skip
    the failing tests until they are fixed.

    Related-Bug: #1998148

    [1] https://zuul.opendev.org/t/openstack/builds?job_name=nova-lvm&branch=master&skip=0

    Change-Id: Id29ce352df84168d0a45512e2c59820aefc75943

Elod Illes (elod-illes)
Changed in nova:
status: New → Confirmed
Elod Illes (elod-illes)
Changed in nova:
importance: Undecided → High
Changed in nova:
assignee: nobody → Jorge San Emeterio (jsanemet)
status: Confirmed → In Progress
Revision history for this message
Jorge San Emeterio (jsanemet) wrote :

I have been tweaking around with this bug for some time now and it is a tricky one. I have not been able to determine exactly why it happens, but there are a few pointers I have figured out:

*.- All the below have been tested on Ubuntu 22.04.2 LTS.

1.- It must be something related to devstack's configuration. Deploying devstack with a default local.conf does not reproduce the bug. A devstack with a local.conf equal or very similar to the one deployed by zuul is required for this happen.

2.- It seems like the image used on the test matters. Running the test with cirros 0.5.2 always results in the timeout happening. However, I have also tried it with alpine instead and, with that, the timeout happened once every three tests or so.

3.- On any case, the timeout happens because the order to detach a volume from a server is issued to qemu and this one takes forever to perform that action, much more that what Nova is willing to wait. Eventually, if you let it be long enough, the volume does get detached.

4.- I have not found any errors happening on libvirt or qemu. The request is issued and the operation goes on as it should, although it takes a long time.

5.- Even though nova eventually gives up waiting and returns to consider the volume 'in-use'. The operation is still ongoing on the background and can result on a mismatch between libvirt and nova.

Maybe someone with more insight on the inner workings of libvirt could lend a hand to determine why this slow down occurs.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/881389

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Download full text (4.3 KiB)

So I reenabled the test cases in the lvm job, bumped the detach timeout to 60 sec from the default 20 sec and the test still fails. The nova-compute log shows the same symptom that that the first detach request to qemu never succeeds and later retries are rejected by qemu:

Apr 24 16:31:47.197069 np0033831695 nova-compute[81721]: WARNING nova.virt.libvirt.driver [None req-49426422-4e77-407c-8207-8cee80cec58d tempest-AttachVolumeTestJSON-1677813526 tempest-AttachVolumeTestJSON-1677813526-project-member] Waiting for libvirt event about the detach of device vdb with device alias virtio-disk1 from instance e3a5dcba-2acf-4f71-96c9-fd723d438f56 is timed out.
Apr 24 16:31:47.199043 np0033831695 nova-compute[81721]: DEBUG nova.virt.libvirt.driver [None req-49426422-4e77-407c-8207-8cee80cec58d tempest-AttachVolumeTestJSON-1677813526 tempest-AttachVolumeTestJSON-1677813526-project-member] Failed to detach device vdb with device alias virtio-disk1 from instance e3a5dcba-2acf-4f71-96c9-fd723d438f56 from the live domain config. Libvirt did not report any error but the device is still in the config. {{(pid=81721) _detach_from_live_with_retry /opt/stack/nova/nova/virt/libvirt/driver.py:2512}}
Apr 24 16:31:47.199290 np0033831695 nova-compute[81721]: DEBUG nova.virt.libvirt.driver [None req-49426422-4e77-407c-8207-8cee80cec58d tempest-AttachVolumeTestJSON-1677813526 tempest-AttachVolumeTestJSON-1677813526-project-member] (2/8): Attempting to detach device vdb with device alias virtio-disk1 from instance e3a5dcba-2acf-4f71-96c9-fd723d438f56 from the live domain config. {{(pid=81721) _detach_from_live_with_retry /opt/stack/nova/nova/virt/libvirt/driver.py:2494}}
Apr 24 16:31:47.199707 np0033831695 nova-compute[81721]: DEBUG nova.virt.libvirt.guest [None req-49426422-4e77-407c-8207-8cee80cec58d tempest-AttachVolumeTestJSON-1677813526 tempest-AttachVolumeTestJSON-1677813526-project-member] detach device xml: <disk type="block" device="disk">
Apr 24 16:31:47.199707 np0033831695 nova-compute[81721]: <driver name="qemu" type="raw" cache="none" io="native"/>
Apr 24 16:31:47.199707 np0033831695 nova-compute[81721]: <source dev="/dev/sdb"/>
Apr 24 16:31:47.199707 np0033831695 nova-compute[81721]: <target dev="vdb" bus="virtio"/>
Apr 24 16:31:47.199707 np0033831695 nova-compute[81721]: <serial>95041152-35bf-4352-9228-9acdac3aa034</serial>
Apr 24 16:31:47.199707 np0033831695 nova-compute[81721]: <address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x0"/>
Apr 24 16:31:47.199707 np0033831695 nova-compute[81721]: </disk>
Apr 24 16:31:47.199707 np0033831695 nova-compute[81721]: {{(pid=81721) detach_device /opt/stack/nova/nova/virt/libvirt/guest.py:465}}
Apr 24 16:31:47.202783 np0033831695 nova-compute[81721]: DEBUG nova.virt.libvirt.driver [None req-49426422-4e77-407c-8207-8cee80cec58d tempest-AttachVolumeTestJSON-1677813526 tempest-AttachVolumeTestJSON-1677813526-project-member] Libvirt returned error while detaching device vdb from instance e3a5dcba-2acf-4f71-96c9-fd723d438f56. Libvirt error code: 1, error message: internal error: unable to execute QEMU command 'device_del': Device virtio-disk1 is already in the process of unplug. {{(pid=81721) _...

Read more...

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote (last edit ):

The test case test_attach_detach_volume_to_instance clearly does not wait for the server to be sshable before does the volume attach and detach https://github.com/openstack/tempest/blob/e64f4c2b07e048ea77d08a29b7afbd17448abc32/tempest/api/volume/test_volumes_actions.py#L42-L55

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Download full text (3.4 KiB)

Even though the test case test_rescued_vm_detach_volume seems to have sshable waiter
https://github.com/openstack/tempest/blob/e64f4c2b07e048ea77d08a29b7afbd17448abc32/tempest/api/compute/servers/test_server_rescue_negative.py#L138-L155

Looking at the templest logs I don't see the test actually doing an ssh before attaching the volume

2023-04-24 16:38:14,298 89987 INFO [tempest.lib.common.rest_client] Request (ServerRescueNegativeTestJSON:test_rescued_vm_detach_volume): 200 GET https://158.69.68.93/compute/v2.1/servers/02d29a98-2867-4691-a3aa-55a38e031f30 0.456s
2023-04-24 16:38:14,298 89987 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-Auth-Token': '<omitted>'}
        Body: None
    Response - Headers: {'date': 'Mon, 24 Apr 2023 16:38:13 GMT', 'server': 'Apache/2.4.52 (Ubuntu)', 'content-length': '1516', 'content-type': 'application/json', 'openstack-api-version': 'compute 2.1', 'x-openstack-nova-api-version': '2.1', 'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', 'x-openstack-request-id': 'req-513b0c16-17b8-4f22-b35c-424ec0574285', 'x-compute-request-id': 'req-513b0c16-17b8-4f22-b35c-424ec0574285', 'connection': 'close', 'status': '200', 'content-location': 'https://158.69.68.93/compute/v2.1/servers/02d29a98-2867-4691-a3aa-55a38e031f30'}
        Body: b'{"server": {"id": "02d29a98-2867-4691-a3aa-55a38e031f30", "name": "tempest-ServerRescueNegativeTestJSON-server-1181505723", "status": "ACTIVE", "tenant_id": "6634ce1b2fb8494fbf87dbeaea9eaade", "user_id": "487a73080c73464b9f0317f224d6f1aa", "metadata": {}, "hostId": "53314c3f02ec2ebee6213867385603d34e2ae1378ed9ba4f7677b308", "image": {"id": "b323bba3-371f-4283-b030-60d286e96e7e", "links": [{"rel": "bookmark", "href": "https://158.69.68.93/compute/images/b323bba3-371f-4283-b030-60d286e96e7e"}]}, "flavor": {"id": "42", "links": [{"rel": "bookmark", "href": "https://158.69.68.93/compute/flavors/42"}]}, "created": "2023-04-24T16:38:05Z", "updated": "2023-04-24T16:38:13Z", "addresses": {"tempest-ServerRescueNegativeTestJSON-704888750-network": [{"version": 4, "addr": "10.1.0.12", "OS-EXT-IPS:type": "fixed", "OS-EXT-IPS-MAC:mac_addr": "fa:16:3e:cc:ae:3d"}]}, "accessIPv4": "", "accessIPv6": "", "links": [{"rel": "self", "href": "https://158.69.68.93/compute/v2.1/servers/02d29a98-2867-4691-a3aa-55a38e031f30"}, {"rel": "bookmark", "href": "https://158.69.68.93/compute/servers/02d29a98-2867-4691-a3aa-55a38e031f30"}], "OS-DCF:diskConfig": "MANUAL", "progress": 0, "OS-EXT-AZ:availability_zone": "nova", "config_drive": "", "key_name": null, "OS-SRV-USG:launched_at": "2023-04-24T16:38:12.000000", "OS-SRV-USG:terminated_at": null, "security_groups": [{"name": "default"}], "OS-EXT-STS:task_state": null, "OS-EXT-STS:vm_state": "active", "OS-EXT-STS:power_state": 1, "os-extended-volumes:volumes_attached": []}}'
2023-04-24 16:38:14,305 89987 INFO [tempest.common.waiters] State transition "BUILD/spawning" ==> "ACTIVE/None" after 9 second wait
2023-04-24 16:38:14,601 89987 INFO [tempest.lib.common.rest_client] Request (ServerRescueNegativeTestJSON:test_rescued_vm_detach_volume): 200 POST https://158.6...

Read more...

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

So having the device_detach_timeout set to 300s did not help either. I'm fairly confident the the failing test cases are not waiting for the guest to be sshable before they attach / detach the volume.

Revision history for this message
Amit Gupta (amikugup) wrote :
Download full text (3.3 KiB)

do we have any update regarding this issue. We are also observing the similar issue where libvirt give up after certain number of retries and we saw following logs in libvirt.

nce: 91add701-b719-496e-9d37-1332568b4ea0] Detach interface failed, port_id=0dd877b0-5431-4f52-bb63-94727c29048c, reason: Device detach failed for tap0dd877b0-54: Run out of retry while detaching device tap0dd877b0-54 with device alias hostdev6 from instance 91add701-b719-496e-9d37-1332568b4ea0 from the live domain config. Device is still attached to the guest.: nova.exception.DeviceDetachFailed: Device detach failed for tap0dd877b0-54: Run out of retry while detaching device tap0dd877b0-54 with device alias hostdev6 from instance 91add701-b719-496e-9d37-1332568b4ea0 from the live domain config. Device is still attached to the guest.
2023-09-08 03:56:11.353 3559248 DEBUG oslo_concurrency.lockutils [req-572b95b9-e051-4de1-a36d-b6ba2128141a f7de05eb5ec71d4dd8ccb3b765fb4c9eb2206acc00018d852f9ff4086fc92cdc f569a10821d544a7877d3130d9da33b8 - 8a596ebd3f9745f7a2948fd057d43920 07bbc0238c5241dd8c90d27af90a5ef7] Lock "interface-91add701-b719-496e-9d37-1332568b4ea0-0dd877b0-5431-4f52-bb63-94727c29048c" "released" by "nova.compute.manager.ComputeManager.detach_interface.<locals>.do_detach_interface" :: held 200.147s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:405
2023-09-08 03:56:11.449 3559248 ERROR oslo_messaging.rpc.server [req-572b95b9-e051-4de1-a36d-b6ba2128141a f7de05eb5ec71d4dd8ccb3b765fb4c9eb2206acc00018d852f9ff4086fc92cdc f569a10821d544a7877d3130d9da33b8 - 8a596ebd3f9745f7a2948fd057d43920 07bbc0238c5241dd8c90d27af90a5ef7] Exception during message handling: nova.exception.InterfaceDetachFailed: Failed to detach network adapter device from 91add701-b719-496e-9d37-1332568b4ea0
2023-09-08 03:56:11.449 3559248 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-09-08 03:56:11.449 3559248 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 8049, in _detach_interface
2023-09-08 03:56:11.449 3559248 ERROR oslo_messaging.rpc.server self.driver.detach_interface(context, instance, condemned)
2023-09-08 03:56:11.449 3559248 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 2890, in detach_interface
2023-09-08 03:56:11.449 3559248 ERROR oslo_messaging.rpc.server device_name=self.vif_driver.get_vif_devname(vif),
2023-09-08 03:56:11.449 3559248 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 2429, in _detach_with_retry
2023-09-08 03:56:11.449 3559248 ERROR oslo_messaging.rpc.server device_name)
2023-09-08 03:56:11.449 3559248 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 2508, in _detach_from_live_with_retry
2023-09-08 03:56:11.449 3559248 ERROR oslo_messaging.rpc.server reason=msg % (device_name, live_dev.alias, instance_uuid))
2023-09-08 03:56:11.449 3559248 ERROR oslo_messaging.rpc.server nova.exception.DeviceDetachFailed: Device detach failed for tap0dd877b0-54: Run out of retry while detaching ...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Stephen Finucane <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/881389

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.