OpenStack Compute (nova)

test_list_get_volume_attachments failing with 400 error on teardown when detaching an already detached volume

Bug #1722577 reported by Matt Riedemann on 2017-10-10

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Won't Fix	Undecided	Unassigned
	tempest	Fix Released	Medium	Matt Riedemann

Bug Description

Noticed this today:

http://logs.openstack.org/04/510704/1/check/gate-tempest-dsvm-py35-ubuntu-xenial/084e7c7/console.html#_2017-10-10_07_50_05_759627

2017-10-10 07:50:05.759627 2017-10-10 07:50:05.759770 2017-10-10 07:50:05.759801 |
2017-10-10 07:50:05.759859 2017-10-10 07:50:05.759942 2017-10-10 07:50:05.759999 | 2017-10-10 07:50:05.760082 | 2017-10-10 07:50:05.760144 | 2017-10-10 07:50:05.760226 | 2017-10-10 07:50:05.760266 | 2017-10-10 07:50:05.760335 | 2017-10-10 07:50:05.760411 | 2017-10-10 07:50:05.760545 | 2017-10-10 07:50:05.760602 | 2017-10-10 07:50:05.760670 | 2017-10-10 07:50:05.760716 | 2017-10-10 07:50:05.760787 | 2017-10-10 07:50:05.760852 | 2017-10-10 07:50:05.760903 | 2017-10-10 07:50:05.761081 | 2017-10-10 07:50:05.761119 | | tempest.api.compute.volumes.test_attach_volume.AttachVolumeTestJSON.test_list_get_volume_attachments[id-7fa563fe-f0f7-43eb-9e22-a1ece036b513]
| ---------------------------------------------------------------------------------------------------------------------------------------------
| Captured traceback:
| ~~~~~~~~~~~~~~~~~~~
b'Traceback (most recent call last):'
b' File "/opt/stack/new/tempest/tempest/lib/common/utils/test_utils.py", line 84, in call_and_ignore_notfound_exc'
b' return func(*args, **kwargs)'
b' File "/opt/stack/new/tempest/tempest/lib/services/compute/servers_client.py", line 441, in detach_volume'
b' (server_id, volume_id))'
b' File "/opt/stack/new/tempest/tempest/lib/common/rest_client.py", line 301, in delete'
b" return self.request('DELETE', url, extra_headers, headers, body)"
b' File "/opt/stack/new/tempest/tempest/lib/services/compute/base_compute_client.py", line 48, in request'
b' method, url, extra_headers, headers, body, chunked)'
b' File "/opt/stack/new/tempest/tempest/lib/common/rest_client.py", line 659, in request'
b' self._error_checker(resp, resp_body)'
b' File "/opt/stack/new/tempest/tempest/lib/common/rest_client.py", line 770, in _error_checker'
b' raise exceptions.BadRequest(resp_body, resp=resp)'
b'tempest.lib.exceptions.BadRequest: Bad request'
b'Details: {\'message\': "Invalid volume: Invalid input received: Invalid volume: Unable to detach volume. Volume status must be \'in-use\' and attach_status must be \'attached\' to detach. (HTTP 400) (Request-ID: req-e4faf8b8-a431-4021-8480-486c94a91543)", \'code\': 400}'
b''

This test does the following:

1. create a server
2. create a volume vol1
3. attach vol1 to the server and wait for the volume status to be in-use
4. repeat steps 3 and 4 for vol2
5. for each volume attachment, detach it and wait for it's status to be 'available'

That all works. The failure is during the cleanup routines on teardown, they try to detach the volume but the volumes are already detached, which results in a 400 response from the compute API (because the compute API gets a 400 response from the volumev3 API):

http://logs.openstack.org/04/510704/1/check/gate-tempest-dsvm-py35-ubuntu-xenial/084e7c7/console.html#_2017-10-10_07_50_05_808506

2017-10-10 07:50:05.808506 | b'2017-10-10 07:30:15,928 2243 INFO [tempest.lib.common.rest_client] Request (AttachVolumeTestJSON:_run_cleanups): 400 DELETE https://15.184.65.28/compute/v2.1/servers/342d8d9e-cd9b-43af-8a25-648de6bf546e/os-volume_attachments/c1f8d572-e907-468b-b44a-60c5bee32094 0.372s'
2017-10-10 07:50:05.808588 | b"2017-10-10 07:30:15,928 2243 DEBUG [tempest.lib.common.rest_client] Request - Headers: {'Content-Type': 'application/json', 'X-Auth-Token': '<omitted>', 'Accept': 'application/json'}"
2017-10-10 07:50:05.808610 | b' Body: None'
2017-10-10 07:50:05.808826 | b" Response - Headers: {'content-location': 'https://15.184.65.28/compute/v2.1/servers/342d8d9e-cd9b-43af-8a25-648de6bf546e/os-volume_attachments/c1f8d572-e907-468b-b44a-60c5bee32094', 'server': 'Apache/2.4.18 (Ubuntu)', 'vary': 'OpenStack-API-Version,X-OpenStack-Nova-API-Version', 'x-openstack-nova-api-version': '2.1', 'x-openstack-request-id': 'req-1a6f6f2c-b64b-4507-8b48-f9600bb4a8fd', 'openstack-api-version': 'compute 2.1', 'content-type': 'application/json; charset=UTF-8', 'date': 'Tue, 10 Oct 2017 07:30:15 GMT', 'connection': 'close', 'x-compute-request-id': 'req-1a6f6f2c-b64b-4507-8b48-f9600bb4a8fd', 'content-length': '269', 'status': '400'}"
2017-10-10 07:50:05.808941 | b' Body: b\'{"badRequest": {"code": 400, "message": "Invalid volume: Invalid input received: Invalid volume: Unable to detach volume. Volume status must be \\\'in-use\\\' and attach_status must be \\\'attached\\\' to detach. (HTTP 400) (Request-ID: req-e4faf8b8-a431-4021-8480-486c94a91543)"}}\''

The cleanup code is handling 404s but not 400s and so the cleanup routine fails.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22_run_cleanups%5C%22%20AND%20message%3A%5C%22400%20DELETE%5C%22%20AND%20message%3A%5C%22compute%2Fv2.1%2Fservers%2F%5C%22%20AND%20message%3A%5C%22os-volume_attachments%2F%5C%22&from=7d

Tags:

Matt Riedemann (mriedem) on 2017-10-10

Changed in tempest:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-10-10:

It should be noted that logstash is only showing this happening since pike, so there might have been a regression in the API behavior, e.g. maybe we were returning a 404 before. If there was a regression, this is likely the cause: https://review.openstack.org/#/c/446671/

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-10-10:

Something else could be that the cinder v3 API, which nova started using by default in Pike, is returning a 400 when detaching an already detached volume, rather than just ignoring the request because it's already detached.

Matt Riedemann (mriedem) on 2017-10-10

Changed in tempest:
status:	Triaged → Invalid
status:	Invalid → Confirmed
Changed in nova:
status:	New → Triaged

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-10-10:

I know what changed, and why it's not 100% failure, because it's a race.

Before Pike, when detaching a volume, the nova compute manager would delete the BlockDeviceMapping for the volume from the nova database *before* telling cinder to detach the volume, which puts it into 'available' status:

a) destroy bdm: https://github.com/openstack/nova/blob/15.0.0/nova/compute/manager.py#L4936

b) mark volume as available: https://github.com/openstack/nova/blob/15.0.0/nova/compute/manager.py#L4941

c) which makes the os-volume_attachments API return a 400 on a subsequent request to detach:

https://github.com/openstack/nova/blob/15.0.0/nova/api/openstack/compute/volumes.py#L442

In Pike, the code in the compute manager changed such that the bdm is deleted after the volume is marked as available:

a) mark volume as available: https://github.com/openstack/nova/blob/16.0.0/nova/compute/manager.py#L4906

b) destroy bdm: https://github.com/openstack/nova/blob/16.0.0/nova/compute/manager.py#L4921

Which is going to result in (c) racing and it might find the BDM before it's deleted and try to detach the already detached volume, which results in a 400 rather than a 404.

Arguably Tempest, as the client, should be checking the volume status before blindly attempting to detach it.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-10-10:

The nova change in pike that introduced the side effect:

https://review.openstack.org/#/c/440693/

Matt Riedemann (mriedem) on 2017-10-10

Changed in tempest:
assignee:	nobody → Matt Riedemann (mriedem)
Changed in nova:
status:	Triaged → Won't Fix

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-10: Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/510951

Changed in tempest:
status:	Confirmed → In Progress

OpenStack Infra (hudson-openstack) on 2017-10-12

Changed in tempest:
assignee:	Matt Riedemann (mriedem) → Ghanshyam Mann (ghanshyammann)

Matt Riedemann (mriedem) on 2017-10-13

Changed in tempest:
assignee:	Ghanshyam Mann (ghanshyammann) → Matt Riedemann (mriedem)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-13: Fix merged to tempest (master)

Reviewed: https://review.openstack.org/510951
Committed: https://git.openstack.org/cgit/openstack/tempest/commit/?id=0d4551b367c9d3f37117a5426a9ca500d791ac25
Submitter: Jenkins
Branch: master

commit 0d4551b367c9d3f37117a5426a9ca500d791ac25
Author: Matt Riedemann <email address hidden>
Date: Tue Oct 10 13:00:48 2017 -0400

Only attempt to detach an in-use volume during cleanup

    test_list_get_volume_attachments intermittently fails
    during cleanup because it tries to detach an already
    detached volume, which results in a 400 response.

    Tempest, as the client, should be checking the volume
    status before making the detach request. The only reason
    this ever worked before Pike was because of some
    (incorrect) ordering of operations in the compute
    service which affected how the API behaved during detach,
    and the compute API would return a 404 rather than a 400.

    That changed with I2581ff9f9c0e7cfc14a25acf45eb1860df69eacf
    in Pike, which exposed the race on the Tempest side by
    deleting the BDM in nova *after* marking the volume as
    'available' in Cinder, and the os-volume_attachments API
    checks for the existence of the BDM and if it exists, attempts
    the detach (which then fails with the 400 from Cinder).

Change-Id: Id2d22cbb86d8d5fa7f71202b274260c1367e8a0f
Closes-Bug: #1722577