test_list_get_volume_attachments failing with 400 error on teardown when detaching an already detached volume

Bug #1722577 reported by Matt Riedemann
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
Undecided
Unassigned
tempest
Fix Released
Medium
Matt Riedemann

Bug Description

Noticed this today:

http://logs.openstack.org/04/510704/1/check/gate-tempest-dsvm-py35-ubuntu-xenial/084e7c7/console.html#_2017-10-10_07_50_05_759627

2017-10-10 07:50:05.759627 | tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_list_get_volume_attachments[id-7fa563fe-f0f7-43eb-9e22-a1ece036b513]
2017-10-10 07:50:05.759770 | ---------------------------------------------------------------------------------------------------------------------------------------------
2017-10-10 07:50:05.759801 |
2017-10-10 07:50:05.759859 | Captured traceback:
2017-10-10 07:50:05.759942 | ~~~~~~~~~~~~~~~~~~~
2017-10-10 07:50:05.759999 | b'Traceback (most recent call last):'
2017-10-10 07:50:05.760082 | b' File "/opt/stack/new/tempest/tempest/lib/common/utils/test_utils.py", line 84, in call_and_ignore_notfound_exc'
2017-10-10 07:50:05.760144 | b' return func(*args, **kwargs)'
2017-10-10 07:50:05.760226 | b' File "/opt/stack/new/tempest/tempest/lib/services/compute/servers_client.py", line 441, in detach_volume'
2017-10-10 07:50:05.760266 | b' (server_id, volume_id))'
2017-10-10 07:50:05.760335 | b' File "/opt/stack/new/tempest/tempest/lib/common/rest_client.py", line 301, in delete'
2017-10-10 07:50:05.760411 | b" return self.request('DELETE', url, extra_headers, headers, body)"
2017-10-10 07:50:05.760545 | b' File "/opt/stack/new/tempest/tempest/lib/services/compute/base_compute_client.py", line 48, in request'
2017-10-10 07:50:05.760602 | b' method, url, extra_headers, headers, body, chunked)'
2017-10-10 07:50:05.760670 | b' File "/opt/stack/new/tempest/tempest/lib/common/rest_client.py", line 659, in request'
2017-10-10 07:50:05.760716 | b' self._error_checker(resp, resp_body)'
2017-10-10 07:50:05.760787 | b' File "/opt/stack/new/tempest/tempest/lib/common/rest_client.py", line 770, in _error_checker'
2017-10-10 07:50:05.760852 | b' raise exceptions.BadRequest(resp_body, resp=resp)'
2017-10-10 07:50:05.760903 | b'tempest.lib.exceptions.BadRequest: Bad request'
2017-10-10 07:50:05.761081 | b'Details: {\'message\': "Invalid volume: Invalid input received: Invalid volume: Unable to detach volume. Volume status must be \'in-use\' and attach_status must be \'attached\' to detach. (HTTP 400) (Request-ID: req-e4faf8b8-a431-4021-8480-486c94a91543)", \'code\': 400}'
2017-10-10 07:50:05.761119 | b''

This test does the following:

1. create a server
2. create a volume vol1
3. attach vol1 to the server and wait for the volume status to be in-use
4. repeat steps 3 and 4 for vol2
5. for each volume attachment, detach it and wait for it's status to be 'available'

That all works. The failure is during the cleanup routines on teardown, they try to detach the volume but the volumes are already detached, which results in a 400 response from the compute API (because the compute API gets a 400 response from the volumev3 API):

http://logs.openstack.org/04/510704/1/check/gate-tempest-dsvm-py35-ubuntu-xenial/084e7c7/console.html#_2017-10-10_07_50_05_808506

2017-10-10 07:50:05.808506 | b'2017-10-10 07:30:15,928 2243 INFO [tempest.lib.common.rest_client] Request (AttachVolumeTestJSON:_run_cleanups): 400 DELETE https://15.184.65.28/compute/v2.1/servers/342d8d9e-cd9b-43af-8a25-648de6bf546e/os-volume_attachments/c1f8d572-e907-468b-b44a-60c5bee32094 0.372s'
2017-10-10 07:50:05.808588 | b"2017-10-10 07:30:15,928 2243 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'X-Auth-Token': '<omitted>', 'Accept': 'application/json'}"
2017-10-10 07:50:05.808610 | b' Body: None'
2017-10-10 07:50:05.808826 | b" Response - Headers: {'content-location': 'https://15.184.65.28/compute/v2.1/servers/342d8d9e-cd9b-43af-8a25-648de6bf546e/os-volume_attachments/c1f8d572-e907-468b-b44a-60c5bee32094', 'server': 'Apache/2.4.18 (Ubuntu)', 'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', 'x-openstack-nova-api-version': '2.1', 'x-openstack-request-id': 'req-1a6f6f2c-b64b-4507-8b48-f9600bb4a8fd', 'openstack-api-version': 'compute 2.1', 'content-type': 'application/json; charset=UTF-8', 'date': 'Tue, 10 Oct 2017 07:30:15 GMT', 'connection': 'close', 'x-compute-request-id': 'req-1a6f6f2c-b64b-4507-8b48-f9600bb4a8fd', 'content-length': '269', 'status': '400'}"
2017-10-10 07:50:05.808941 | b' Body: b\'{"badRequest": {"code": 400, "message": "Invalid volume: Invalid input received: Invalid volume: Unable to detach volume. Volume status must be \\\'in-use\\\' and attach_status must be \\\'attached\\\' to detach. (HTTP 400) (Request-ID: req-e4faf8b8-a431-4021-8480-486c94a91543)"}}\''

The cleanup code is handling 404s but not 400s and so the cleanup routine fails.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22_run_cleanups%5C%22%20AND%20message%3A%5C%22400%20DELETE%5C%22%20AND%20message%3A%5C%22compute%2Fv2.1%2Fservers%2F%5C%22%20AND%20message%3A%5C%22os-volume_attachments%2F%5C%22&from=7d

Matt Riedemann (mriedem)
Changed in tempest:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

It should be noted that logstash is only showing this happening since pike, so there might have been a regression in the API behavior, e.g. maybe we were returning a 404 before. If there was a regression, this is likely the cause: https://review.openstack.org/#/c/446671/

Revision history for this message
Matt Riedemann (mriedem) wrote :

Something else could be that the cinder v3 API, which nova started using by default in Pike, is returning a 400 when detaching an already detached volume, rather than just ignoring the request because it's already detached.

Matt Riedemann (mriedem)
Changed in tempest:
status: Triaged → Invalid
status: Invalid → Confirmed
Changed in nova:
status: New → Triaged
Revision history for this message
Matt Riedemann (mriedem) wrote :

I know what changed, and why it's not 100% failure, because it's a race.

Before Pike, when detaching a volume, the nova compute manager would delete the BlockDeviceMapping for the volume from the nova database *before* telling cinder to detach the volume, which puts it into 'available' status:

a) destroy bdm: https://github.com/openstack/nova/blob/15.0.0/nova/compute/manager.py#L4936

b) mark volume as available: https://github.com/openstack/nova/blob/15.0.0/nova/compute/manager.py#L4941

c) which makes the os-volume_attachments API return a 400 on a subsequent request to detach:

https://github.com/openstack/nova/blob/15.0.0/nova/api/openstack/compute/volumes.py#L442

--

In Pike, the code in the compute manager changed such that the bdm is deleted after the volume is marked as available:

a) mark volume as available: https://github.com/openstack/nova/blob/16.0.0/nova/compute/manager.py#L4906

b) destroy bdm: https://github.com/openstack/nova/blob/16.0.0/nova/compute/manager.py#L4921

Which is going to result in (c) racing and it might find the BDM before it's deleted and try to detach the already detached volume, which results in a 400 rather than a 404.

--

Arguably Tempest, as the client, should be checking the volume status before blindly attempting to detach it.

Revision history for this message
Matt Riedemann (mriedem) wrote :

The nova change in pike that introduced the side effect:

https://review.openstack.org/#/c/440693/

Matt Riedemann (mriedem)
Changed in tempest:
assignee: nobody → Matt Riedemann (mriedem)
Changed in nova:
status: Triaged → Won't Fix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/510951

Changed in tempest:
status: Confirmed → In Progress
Changed in tempest:
assignee: Matt Riedemann (mriedem) → Ghanshyam Mann (ghanshyammann)
Matt Riedemann (mriedem)
Changed in tempest:
assignee: Ghanshyam Mann (ghanshyammann) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tempest (master)

Reviewed: https://review.openstack.org/510951
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=0d4551b367c9d3f37117a5426a9ca500d791ac25
Submitter: Jenkins
Branch: master

commit 0d4551b367c9d3f37117a5426a9ca500d791ac25
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 10 13:00:48 2017 -0400

    Only attempt to detach an in-use volume during cleanup

    test_list_get_volume_attachments intermittently fails
    during cleanup because it tries to detach an already
    detached volume, which results in a 400 response.

    Tempest, as the client, should be checking the volume
    status before making the detach request. The only reason
    this ever worked before Pike was because of some
    (incorrect) ordering of operations in the compute
    service which affected how the API behaved during detach,
    and the compute API would return a 404 rather than a 400.

    That changed with I2581ff9f9c0e7cfc14a25acf45eb1860df69eacf
    in Pike, which exposed the race on the Tempest side by
    deleting the BDM in nova *after* marking the volume as
    'available' in Cinder, and the os-volume_attachments API
    checks for the existence of the BDM and if it exists, attempts
    the detach (which then fails with the 400 from Cinder).

    Change-Id: Id2d22cbb86d8d5fa7f71202b274260c1367e8a0f
    Closes-Bug: #1722577

Changed in tempest:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tempest 17.1.0

This issue was fixed in the openstack/tempest 17.1.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.