MultiAttachVolumeSwap fails or takes a long time to detach volume

Bug #1989232 reported by Aboubacar Diare
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Medium
Unassigned
os-brick
New
Undecided
Unassigned
tempest
New
Undecided
Unassigned

Bug Description

ERROR: tempest.api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach fails during tempest iSCSI tests due to volume taking a long time to detach or failing to detach from instance. The logs herein show an example of a failure to detach.

EXEPCTED BEHAVIOR: Volume successfully detaches and test passes.

HOW TO DUPLICATE:
Run: tox -e all -- tempest.api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach | tee -a console.log.out

CONFIG:
- DevStack Zed Release
- Single node using iSCSI
- Host OS: Ubuntu 20.04
Distributor ID: Ubuntu
Description: Ubuntu 20.04.3 LTS
Release: 20.04
Codename: focal

From tempest console.log:

tempest.api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach[id-e8f
8f9d1-d7b7-4cd2-8213-ab85ef697b6e,slow,volume]
-------------------------------------------------------------------------------------------------------------
----------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):

      File "/opt/stack/tempest/tempest/lib/decorators.py", line 81, in wrapper
    return f(*func_args, **func_kwargs)

      File "/opt/stack/tempest/tempest/common/utils/__init__.py", line 70, in wrapper
    return f(*func_args, **func_kwargs)

      File "/opt/stack/tempest/tempest/api/compute/admin/test_volume_swap.py", line 245, in test_volume_swap_
with_multiattach
    waiters.wait_for_volume_resource_status(self.volumes_client,

      File "/opt/stack/tempest/tempest/common/waiters.py", line 301, in wait_for_volume_resource_status
    time.sleep(client.build_interval)

      File "/opt/stack/tempest/.tox/tempest/lib/python3.8/site-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
    raise TimeoutException()

    fixtures._fixtures.timeout.TimeoutException

Captured traceback-1:
~~~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):

      File "/opt/stack/tempest/tempest/common/waiters.py", line 385, in wait_for_volume_attachment_remove_from_server
    raise lib_exc.TimeoutException(message)

    tempest.lib.exceptions.TimeoutException: Request timed out
Details: Volume a54c67b7-786e-4ba7-94ea-d1e0a722424a failed to detach from server 986b2dd5-542a-4344-a929-9ac7bbf35d7c within the required time (3600 s) from the compute API perspective

In waiters.py:

    373 while any(volume for volume in volumes if volume['volumeId'] == volume_id):
    374 time.sleep(client.build_interval)
    375
    376 timed_out = int(time.time()) - start >= client.build_timeout
    377 if timed_out:
    378 console_output = client.get_console_output(server_id)['output']
    379 LOG.debug('Console output for %s\nbody=\n%s',
    380 server_id, console_output)
    381 message = ('Volume %s failed to detach from server %s within '
    382 'the required time (%s s) from the compute API '
    383 'perspective' %
    384 (volume_id, server_id, client.build_timeout))
    385 raise lib_exc.TimeoutException(message)
    386 try:
    387 volumes = client.list_volume_attachments(
    388 server_id)['volumeAttachments']
    389 except lib_exc.NotFound:
    390 # Ignore 404s on detach in case the server is deleted or the volume
    391 # is already detached.
    392 return
    393 return

Tags: volumes
Revision history for this message
Aboubacar Diare (adiare) wrote :
description: updated
Revision history for this message
Aboubacar Diare (adiare) wrote :

Adding Tempest.log

Revision history for this message
Aboubacar Diare (adiare) wrote :

Adding Cinder Volume Log

Revision history for this message
Aboubacar Diare (adiare) wrote :

Adding Cinder API log.

Revision history for this message
Aboubacar Diare (adiare) wrote (last edit ):
Download full text (32.1 KiB)

Some additional details from Log investigation

The test that fails is expected to perform the following operations.
1- Create a Volume 1.
2- Create a Volume 2.
3- Create and boot 2 instances Server 1 and Server 2.
4- Attach Volume 1 to Server 1.
5- Attach Volume 1 to Server 2.
6- Swap Volume 1 with Volume 2 on Server 1.
7- Check that Volume 1 is attached to Server 2.
8- Check that Volume 1 is NOT attached to Server 1.
9- Check that Volume 2 is attached to Server 1.
10- Clean up and Exit

From the details below. It is not clear what is causing the long detach as reported by tempest.

In the Tempest.log file you will find (See tempest.log clipping attached)

- Volume 1 is successfully created (ID: a54c67b7-786e-4ba7-94ea-d1e0a722424a)
                - However it take close to 5mins to create volume 1.
- Volume 2 is successfully created (ID: 9c237d89-284d-4608-8585-97f5be2fdb87)
                - Volume 2 is created in 15secs.
- Two instances (servers) 1 and 2 are successfully created (Server 1 ID: 43ba7e9c-8923-4587-bbd8-c91cf6ea6458 and Server 2 ID: 986b2dd5-542a-4344-a929-9ac7bbf35d7c)
- Volume 1 is successfully attached to Server 1 – Attachment ID: 02b20af4-af46-4f9e-840b-4664ebc093e0
- Volume 1 is successfully attached to Server 2 – Attachment ID: d5a7a2fa-7d22-4827-8c9b-e068727a927c
- Swap Volume 1 with Volume 2 on Server 2

2022-09-09 15:03:59.542 2703951 INFO tempest.lib.common.rest_client [req-75469d39-ab9e-48d6-b727-41448acc54d4 req-75469d39-ab9e-48d6-b727-41448acc54d4 ] Request (TestMultiAttachVolumeSwap:test_volume_swap_with_multiattach): 202 PUT http://15.113.218.46/compute/v2.1/servers/43ba7e9c-8923-4587-bbd8-c91cf6ea6458/os-volume_attachments/a54c67b7-786e-4ba7-94ea-d1e0a722424a 0.563s
2022-09-09 15:03:59.543 2703951 DEBUG tempest.lib.common.rest_client [req-75469d39-ab9e-48d6-b727-41448acc54d4 req-75469d39-ab9e-48d6-b727-41448acc54d4 ] Request - Headers: {'Content-Type': 'application/json', 'Accept': 'application/json', 'X-OpenStack-Nova-API-Version': '2.60', 'X-Auth-Token': '<omitted>'}
        Body: {"volumeAttachment": {"volumeId": "9c237d89-284d-4608-8585-97f5be2fdb87"}}
    Response - Headers: {'date': 'Fri, 09 Sep 2022 22:03:58 GMT', 'server': 'Apache/2.4.41 (Ubuntu)', 'content-length': '0', 'content-type': 'application/json', 'openstack-api-version': 'compute 2.60', 'x-openstack-nova-api-version': '2.60', 'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', 'x-openstack-request-id': 'req-75469d39-ab9e-48d6-b727-41448acc54d4', 'x-compute-request-id': 'req-75469d39-ab9e-48d6-b727-41448acc54d4', 'connection': 'close', 'status': '202', 'content-location': 'http://15.113.218.46/compute/v2.1/servers/43ba7e9c-8923-4587-bbd8-c91cf6ea6458/os-volume_attachments/a54c67b7-786e-4ba7-94ea-d1e0a722424a'}
        Body: b'' _log_request_full /opt/stack/tempest/tempest/lib/common/rest_client.py:455

- Check Volume 1 attachments:
                - Volume 1 is attached to Server 1 – Status: Detaching
                - Volume 1 is attached to Server 2

2022-09-09 15:03:59.614 2703951 INFO tempest.lib.common.rest_client [req-a6d4d438-52cc-4d7c-8a19-cd700b230357 req-a6d4d438-52cc-4d7c-8a19-cd700b230357 ] Request (TestMultiAttachVol...

Revision history for this message
Aboubacar Diare (adiare) wrote (last edit ):
Download full text (12.9 KiB)

I wanted to share additional insights for this bug. It appears there is a race condition in Nova that causes Nova to tear down the iSCSI connection to the volume before the volume swap happens.

Below is the analysis.

Tempest.log shows (filtering just on the common.waiters process). This provides a high level but effective summary of the operations as they are happening.

1- Create Volume 1
2022-09-30 02:47:31.421 998746 INFO tempest.common.waiters [-] volume 34a66f7b-9580-4b83-b853-dbca4b43c13e reached available after waiting for 15.421489 seconds
2- Create Volume 2
2022-09-30 02:47:47.021 998746 INFO tempest.common.waiters [-] volume cd473ad8-afb4-4373-abbd-42845cdefed4 reached available after waiting for 16.021275 seconds
3- Create and boot instance 1 and instance 2 (But seems more like a timeout to wait for the instances to report ready)
2022-09-30 02:48:49.301 998746 INFO tempest.common.waiters [-] State transition "BUILD/scheduling" ==> "ACTIVE/None" after 61 second wait
4- Attach Volume 1 to instance 1
2022-09-30 02:49:05.578 998746 INFO tempest.common.waiters [-] Attachment b0e48a10-a2b7-49f8-9226-db8bea7dc901 created for volume 34a66f7b-9580-4b83-b853-dbca4b43c13e to server 0d711b7b-4693-4a7e-9a94-ca4186b4a670 after waiting for 15.578472 seconds
5- Attach Volume 1 to instance 2
2022-09-30 02:49:21.178 998746 INFO tempest.common.waiters [-] Attachment ea690a72-b43f-4155-b882-704c4d780d0e created for volume 34a66f7b-9580-4b83-b853-dbca4b43c13e to server b87546de-0ad9-4c78-ab53-bd2245c40fb8 after waiting for 16.178497 seconds
6- Swap Volumes
2022-09-30 02:51:52.798 998746 INFO tempest.common.waiters [-] volume 34a66f7b-9580-4b83-b853-dbca4b43c13e reached in-use after waiting for 151.798836 seconds

Nova Logs shows:

Sep 30 02:47:50 Lock "compute_resources" "released" by "nova.compute.resource_tracker.ResourceTracker.instance_claim" :: held 0.220s {{(pid=2571444) inner /usr/local/lib/python3.8/dist-packages/oslo_concurrency/lockutils.py:400}}
Sep 30 02:47:50 Lock "compute_resources" "released" by "nova.compute.resource_tracker.ResourceTracker.instance_claim" :: held 0.246s {{(pid=2571444) inner /usr/local/lib/python3.8/dist-packages/oslo_concurrency/lockutils.py:400}}
Sep 30 02:47:50 Lock "/opt/stack/data/nova/instances/0d711b7b-4693-4a7e-9a94-ca4186b4a670/disk.info" "released" by "nova.virt.libvirt.imagebackend.Image.resolve_driver_format.<locals>.write_to_disk_info_file" :: held 0.003s {{(pid=2571444) inner /usr/local/lib/python3.8/dist-packages/oslo_concurrency/lockutils.py:400}}
Sep 30 02:47:50 Lock "/opt/stack/data/nova/instances/b87546de-0ad9-4c78-ab53-bd2245c40fb8/disk.info" "released" by "nova.virt.libvirt.imagebackend.Image.resolve_driver_format.<locals>.write_to_disk_info_file" :: held 0.002s {{(pid=2571444) inner /usr/local/lib/python3.8/dist-packages/oslo_concurrency/lockutils.py:400}}
Sep 30 02:47:50 Lock "bf54a85eb14ee030c394d58e82febb0f522d915a" "released" by "nova.virt.libvirt.imagebackend.Qcow2.create_image.<locals>.create_qcow2_image" :: held 0.241s {{(pid=2571444) inner /usr/local/lib/python3.8/dist-packages/oslo_concurrency/lockutils.py:400}}
Sep 30 02:47:51 Lock "bf54a85eb14ee030c394d58e82febb0f52...

Revision history for this message
Lukas Piwowarski (lukas-piwowarski) wrote :

Hi Aboubacar, thanks for very well written and detailed bug report! I was trying to reproduce the issue but I was unsuccessful. I had a problem recreating the environment. You mentioned that you ran the test on devstack.

Could you please provide local.conf you used for installation of devstack? It would make reproducing the error easier.

Revision history for this message
Aboubacar Diare (adiare) wrote :

@Lukas-piwowarski thank you for looking at this. I have attached the local.conf file. Note that the issue can be intermittent. The two tempest tests I have seen exhibit this issue consistently though in an intermittent fashion are:

tempest.api.compute.admin.test_volume_swap.TestVolumeSwap.test_volume_swap
tempest.api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach

Regards,
Aboubacar.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Thanks Aboubacar for the well written bug report.

I agree that we have a race between the disconnect and the swap operation. Both uses a lock but they use different locks so they can overlap.

Failed case:

Sep 30 02:51:47 Lock "connect_volume" "released" by "os_brick.initiator.connectors.iscsi.ISCSIConnector.disconnect_volume" :: held 0.156s {{(pid=2571444) inner /usr/local/lib/python3.8/dist-packages/oslo_concurrency/lockutils.py:400}}
Sep 30 02:51:55 Lock "0d711b7b-4693-4a7e-9a94-ca4186b4a670" "released" by "nova.compute.manager.ComputeManager.swap_volume.<locals>._do_locked_swap_volume" :: held 153.400s {{(pid=2571444) inner /usr/local/lib/python3.8/dist-packages/oslo_concurrency/lockutils.py:400}}

Successful case:

Sep 29 17:12:21 Lock "4aeaeb5d-295f-4149-9330-a016d9da1730" "released" by "nova.compute.manager.ComputeManager.swap_volume.<locals>._do_locked_swap_volume" :: held 632.783s {{(pid=2571444) inner /usr/local/lib/python3.8/dist-packages/oslo_concurrency/lockutils.py:400}}
Sep 29 17:12:25 Lock "connect_volume" "released" by "os_brick.initiator.connectors.iscsi.ISCSIConnector.disconnect_volume" :: held 0.142s {{(pid=2571444) inner /usr/local/lib/python3.8/dist-packages/oslo_concurrency/lockutils.py:400}}

Changed in nova:
status: New → Triaged
importance: Undecided → Medium
tags: added: volumes
Revision history for this message
Aboubacar Diare (adiare) wrote :

Balazs

Thank you for the triage and response.

Let me know if any additional details are needed from me.

AD.

Revision history for this message
Aboubacar Diare (adiare) wrote :

Hi all.

I was curious about the status of this bug. We still encounter this issue in our testing.

Best,

AD

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.